RubyGems - staticizer - Versions diffs - 0.0.2 - Mend

staticizer 0.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (14) hide show

checksums.yaml ADDED Viewed

@@ -0,0 +1,7 @@
+---
+SHA1:
+  metadata.gz: 60dbe34b0e69038956585a2c25804422be99bf3e
+  data.tar.gz: 539fb871a2cbbb99d81b5365f942f8a04c2232bf
+SHA512:
+  metadata.gz: cbb0db4aa4864d060534f971a1c16d8d66d26967ae9bc04c7fe6d155339ba80b9c4fe00c30d11fef4c7a696e8f9a8441428b0fee9ae3d865671690257480141d
+  data.tar.gz: 613cb6ac0a9109443f199fa294da8c0a0106b543555402738cabe3a996b12c033d27a8414c52f79f5daa92ecd3a0a9bb0c816bf3558638e712d64909147b1474

data/.gitignore ADDED Viewed

@@ -0,0 +1,17 @@
+*.gem
+*.rbc
+.bundle
+.config
+.yardoc
+Gemfile.lock
+InstalledFiles
+_yardoc
+coverage
+doc/
+lib/bundler/man
+pkg
+rdoc
+spec/reports
+test/tmp
+test/version_tmp
+tmp

data/Gemfile ADDED Viewed

@@ -0,0 +1,4 @@
+source 'https://rubygems.org'
+# Specify your gem's dependencies in staticizer.gemspec
+gemspec

data/LICENSE.txt ADDED Viewed

@@ -0,0 +1,22 @@
+Copyright (c) 2014 Conor Hunt
+MIT License
+Permission is hereby granted, free of charge, to any person obtaining
+a copy of this software and associated documentation files (the
+"Software"), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
+LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
+OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
+WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

data/README.md ADDED Viewed

@@ -0,0 +1,110 @@
+# Staticizer
+A tool to create a static version of a website for hosting on S3.
+## Rationale
+One of our clients needed a reliable emergency backup for a
+website. If the website goes down this backup would be available
+with reduced functionality.
+S3 and Route 53 provide an great way to host a static emergency backup for a website.
+See this article - http://aws.typepad.com/aws/2013/02/create-a-backup-website-using-route-53-dns-failover-and-s3-website-hosting.html
+. In our experience it works very well and is incredibly cheap at less than US$1 a month (depending on the size of the website).
+We tried using exsisting tools httrack/wget to crawl and create a static version
+of the site to upload to S3, but we found that they did not work well with S3 hosting.
+We wanted the site uploaded to S3 to respond to the *exact* same URLs (where possible) as
+the existing site. This way when the  site goes down incoming links from Google search
+results etc. will still work.
+## Installation
+Add this line to your application's Gemfile:
+    gem 'staticizer'
+And then execute:
+    $ bundle
+Or install it yourself as:
+    $ gem install s3static
+## Command line usage
+The tool can either be used via the 'staticizer' commandline tool or via requiring the library.
+### Crawl a website and write to disk
+    staticizer http://squaremill.com -output-dir=/tmp/crawl
+### Crawl a website and upload to AWS
+    staticizer http://squaremill.com -aws-s3-bucket=squaremill.com --aws-access-key=HJFJS5gSJHMDZDFFSSDQQ --aws-secret-key=HIA7T189234aADfFAdf322Vs12duRhOHy+23mc1+s
+### Crawl a website and allow several domains to be crawled
+    staticizer http://squaremill.com --valid-domains=squaremill.com,www.squaremill.com,img.squaremill.com
+## Code Usage
+For all these examples you must first:
+    require 'staticizer'
+### Crawl a website and upload to AWS
+This will only crawl urls in the domain squaremill.com
+    s = Staticizer::Crawler.new("http://squaremill.com",
+      :aws => {
+        :bucket_name => "www.squaremill.com",
+        :secret_access_key => "HIA7T189234aADfFAdf322Vs12duRhOHy+23mc1+s",
+        :access_key_id => "HJFJS5gSJHMDZDFFSSDQQ"
+      }
+    )
+    s.crawl
+### Crawl a website and write to disk
+    s = Staticizer::Crawler.new("http://squaremill.com", :output_dir => "/tmp/crawl")
+    s.crawl
+### Crawl a website and rewrite all non www urls to www
+    s = Staticizer::Crawler.new("http://squaremill.com",
+      :aws => {
+        :bucket_name => "www.squaremill.com",
+        :secret_access_key => "HIA7T189234aADfFAdf322Vs12duRhOHy+23mc1+s",
+        :access_key_id => "HJFJS5gSJHMDZDFFSSDQQ"
+      },
+      :filter_url => lambda do |url, info|
+        # Only crawl URL if it matches squaremill.com or www.squaremil.com
+        if url =~ %r{https?://(www\.)?squaremill\.com}
+          # Rewrite non-www urls to www
+          return url.gsub(%r{https?://(www\.)?squaremill\.com}, "http://www.squaremill.com")
+        end
+        # returning nil here prevents the url from being crawled
+      end
+    )
+    s.crawl
+## Cralwer Options
+* :aws - Hash of connection options passed to aws/sdk gem
+* :filter_url - proc called to see if a discovered URL should be crawled, return nil to not crawl a url, return the url (can be modified) to crawl
+* :output_dir - if writing a site to disk the directory to write to, will be created if it does not exist
+* :logger - A logger object responding to the usual Ruby Logger methods.
+* :log_level - Log level - defaults to INFO.
+# :valid_domains - Array of domains that should be crawled. Domains not in this list will be ignored.
+## Contributing
+1. Fork it
+2. Create your feature branch (`git checkout -b my-new-feature`)
+3. Commit your changes (`git commit -am 'Add some feature'`)
+4. Push to the branch (`git push origin my-new-feature`)
+5. Create new Pull Request

data/Rakefile ADDED Viewed

	@@ -0,0 +1 @@
1	+ require "bundler/gem_tasks"

data/bin/staticizer ADDED Viewed

@@ -0,0 +1,11 @@
+#!/usr/bin/env ruby
+lib = File.expand_path(File.dirname(__FILE__) + '/../lib')
+$LOAD_PATH.unshift(lib) if File.directory?(lib) && !$LOAD_PATH.include?(lib)
+require 'staticizer'
+require 'staticizer/command'
+options, initial_page = Staticizer::Command.parse(ARGV)
+s = Staticizer::Crawler.new(initial_page, options)
+s.crawl

data/lib/staticizer/command.rb ADDED Viewed

@@ -0,0 +1,76 @@
+require 'optparse'
+module Staticizer
+  class Command
+    # Parse command line arguments and print out any errors
+    def Command.parse(args)
+      options = {}
+      initial_page = nil
+      parser = OptionParser.new do |opts|
+        opts.banner = "Usage: staticizer initial_url [options]\nExample: staticizer http://squaremill.com --output-dir=/tmp/crawl"
+        opts.separator ""
+        opts.separator "Specific options:"
+        opts.on("--aws-s3-bucket [STRING]", "Name of S3 bucket to write to") do |v|
+          options[:aws] ||= {}
+          options[:aws][:bucket_name] = v
+        end
+        opts.on("--aws-access-key [STRING]", "AWS Access Key ID") do |v|
+          options[:aws] ||= {}
+          options[:aws][:access_key_id] = v
+        end
+        opts.on("--aws-secret-key [STRING]", "AWS Secret Access Key") do |v|
+          options[:aws] ||= {}
+          options[:aws][:secret_access_key] = v
+        end
+        opts.on("-d", "--output-dir [DIRECTORY]", "Write crawl to disk in this directory, will be created if it does not exist") do |v|
+          options[:output_dir] = v
+        end
+        opts.on("-v", "--verbose", "Run verbosely (sets log level to Logger::DEBUG)") do |v|
+          options[:log_level] = Logger::DEBUG
+        end
+        opts.on("--log-level [NUMBER]", "Set log level 0 = most verbose to 4 = least verbose") do |v|
+          options[:log_level] = v.to_i
+        end
+        opts.on("--log-file [PATH]", "Log file to write to") do |v|
+          options[:logger] = Logger.new(v)
+        end
+        opts.on("--valid-domains x,y,z", Array, "Comma separated list of domains that should be crawled, other domains will be ignored") do |v|
+          options[:valid_domains] = v
+        end
+        opts.on_tail("-h", "--help", "Show this message") do
+          puts "test"
+          puts opts
+          exit
+        end
+      end
+      begin
+        parser.parse!(args)
+        initial_page = ARGV.pop
+        raise ArgumentError, "Need to specify an initial URL to start the crawl" unless initial_page
+      rescue StandardError => e
+        puts e
+        puts parser
+        exit(1)
+      end
+      return options, initial_page
+    end
+  end
+end
+=begin
+=end

data/lib/staticizer/crawler.rb ADDED Viewed

@@ -0,0 +1,213 @@
+require 'net/http'
+require 'fileutils'
+require 'nokogiri'
+require 'aws-sdk'
+require 'logger'
+module Staticizer
+  class Crawler
+    def initialize(initial_page, opts = {})
+      if initial_page.nil?
+        raise ArgumentError, "Initial page required"
+      end
+      @opts = opts.dup
+      @url_queue = []
+      @processed_urls = []
+      @opts[:output_dir] ||= File.expand_path("crawl/")
+      @log = @opts[:logger] || Logger.new(STDOUT)
+      @log.level = @opts[:log_level] || Logger::INFO
+      if @opts[:aws]
+        bucket_name = @opts[:aws].delete(:bucket_name)
+        AWS.config(opts[:aws])
+        @s3_bucket = AWS::S3.new.buckets[bucket_name]
+        @s3_bucket.acl = :public_read
+      end
+      if @opts[:valid_domains].nil?
+        uri = URI.parse(initial_page)
+        @opts[:valid_domains] ||= [uri.host]
+      end
+      add_url(initial_page)
+    end
+    def crawl
+      @log.info("Starting crawl")
+      while(@url_queue.length > 0)
+        url, info = @url_queue.shift
+        @processed_urls << url
+        process_url(url, info)
+      end
+      @log.info("Finished crawl")
+    end
+    def extract_hrefs(doc, base_uri)
+      doc.xpath("//a/@href").map {|href| make_absolute(base_uri, href) }
+    end
+    def extract_images(doc, base_uri)
+      doc.xpath("//img/@src").map {|src| make_absolute(base_uri, src) }
+    end
+    def extract_links(doc, base_uri)
+      doc.xpath("//link/@href").map {|href| make_absolute(base_uri, href) }
+    end
+    def extract_scripts(doc, base_uri)
+      doc.xpath("//script/@src").map {|src| make_absolute(base_uri, src) }
+    end
+    def extract_css_urls(css, base_uri)
+      css.scan(/url\(([^)]+)\)/).map {|src| make_absolute(base_uri, src[0]) }
+    end
+    def add_urls(urls, info = {})
+      urls.compact.uniq.each {|url| add_url(url, info.dup) }
+    end
+    def make_absolute(base_uri, href)
+      URI::join(base_uri, href).to_s
+    rescue StandardError => e
+      @log.error "Could not make absolute #{base_uri} - #{href}"
+    end
+    def add_url(url, info = {})
+      if @opts[:filter_url]
+        url = @opts[:filter_url].call(url, info)
+        return if url.nil?
+      else
+        regex = "(#{@opts[:valid_domains].join(")|(")})"
+        return if url !~ %r{^https?://#{regex}}
+      end
+      url = url.sub(/#.*$/,'') # strip off any fragments
+      return if @url_queue.index {|u| u[0] == url } || @processed_urls.include?(url)
+      @url_queue << [url, info]
+    end
+    def save_page(response, uri)
+      if @opts[:aws]
+        save_page_to_aws(response, uri)
+      else
+        save_page_to_disk(response, uri)
+      end
+    end
+    def save_page_to_disk(response, uri)
+      path = uri.path
+      path += "?#{uri.query}" if uri.query
+      path_segments = path.scan(%r{[^/]*/})
+      filename = path.include?("/") ? path[path.rindex("/")+1..-1] : path
+      current = @opts[:output_dir]
+      FileUtils.mkdir_p(current) unless File.exist?(current)
+      # Create all the directories necessary for this file
+      path_segments.each do |segment|
+        current = File.join(current, "#{segment}").sub(%r{/$},'')
+        if File.file?(current)
+          # If we are trying to create a directory and there already is a file
+          # with the same name add a .d to the file since we can't create
+          # a directory and file with the same name in the file system
+          dirfile = current + ".d"
+          FileUtils.mv(current, dirfile)
+          FileUtils.mkdir(current)
+          FileUtils.cp(dirfile, File.join(current, "/index.html"))
+        elsif !File.exists?(current)
+          FileUtils.mkdir(current)
+        end
+      end
+      body = response.respond_to?(:read_body) ? response.read_body : response
+      outfile = File.join(current, "/#{filename}")
+      if filename == ""
+        indexfile = File.join(outfile, "/index.html")
+        @log.info "Saving #{indexfile}"
+        File.open(indexfile, "wb") {|f| f << body }
+      elsif File.directory?(outfile)
+        dirfile = outfile + ".d"
+        @log.info "Saving #{dirfile}"
+        File.open(dirfile, "wb") {|f| f << body }
+        FileUtils.cp(dirfile, File.join(outfile, "/index.html"))
+      else
+        @log.info "Saving #{outfile}"
+        File.open(outfile, "wb") {|f| f << body }
+      end
+    end
+    def save_page_to_aws(response, uri)
+      key = uri.path
+      key += "?#{uri.query}" if uri.query
+      key = key.gsub(%r{^/},"")
+      key = "index.html" if key == ""
+      # Upload this file directly to AWS::S3
+      opts = {:acl => :public_read}
+      opts[:content_type] = response['content-type'] rescue "text/html"
+      @log.info "Uploading #{key} to s3 with content type #{opts[:content_type]}"
+      if response.respond_to?(:read_body)
+        @s3_bucket.objects[key].write(response.read_body, opts)
+      else
+        @s3_bucket.objects[key].write(response, opts)
+      end
+    end
+    def process_success(response, parsed_uri)
+      url = parsed_uri.to_s
+      case response['content-type']
+      when /css/
+        save_page(response, parsed_uri)
+        add_urls(extract_css_urls(response.body, url), {:type_hint => "css_url"})
+      when /html/
+        save_page(response, parsed_uri)
+        doc = Nokogiri::HTML(response.body)
+        add_urls(extract_links(doc, url), {:type_hint => "link"})
+        add_urls(extract_scripts(doc, url), {:type_hint => "script"})
+        add_urls(extract_images(doc, url), {:type_hint => "image"})
+        add_urls(extract_hrefs(doc, url), {:type_hint => "href"})
+      else
+        save_page(response, parsed_uri)
+      end
+    end
+    # If we hit a redirect we save the redirect as a meta refresh page
+    # TODO: for AWS S3 hosting we could instead create a redirect?
+    def process_redirect(url, destination_url)
+      body = "<html><head><META http-equiv='refresh' content='0;URL=\"#{destination_url}\"'></head><body>You are being redirected to <a href='#{destination_url}'>#{destination_url}</a>.</body></html>"
+      save_page_to_aws(body, url)
+    end
+    # Fetch a URI and save it to disk
+    def process_url(url, info)
+      @http_connections ||= {}
+      parsed_uri = URI(url)
+      @log.debug "Fetching #{parsed_uri}"
+      # Attempt to use an already open Net::HTTP connection
+      key = parsed_uri.host + parsed_uri.port.to_s
+      connection = @http_connections[key]
+      if connection.nil?
+        connection = Net::HTTP.new(parsed_uri.host, parsed_uri.port)
+        @http_connections[key] = connection
+      end
+      request = Net::HTTP::Get.new(parsed_uri.request_uri)
+      connection.request(request) do |response|
+        case response
+        when Net::HTTPSuccess
+          process_success(response, parsed_uri)
+        when Net::HTTPRedirection
+          redirect_url = response['location']
+          @log.debug "Processing redirect to #{redirect_url}"
+          process_redirect(parsed_uri, redirect_url)
+          add_url(redirect_url)
+        else
+          @log.error "Error #{response.code}:#{response.message} fetching url #{url}"
+        end
+      end
+    end
+  end
+end

data/lib/staticizer/version.rb ADDED Viewed

@@ -0,0 +1,3 @@
+module Staticizer
+  VERSION = "0.0.2"
+end

data/lib/staticizer.rb ADDED Viewed

@@ -0,0 +1,9 @@
+require_relative "staticizer/version"
+require_relative 'staticizer/crawler'
+module Staticizer
+  def Staticizer.crawl(url, options = {}, &block)
+    cralwer = Staticizer::Crawler.new(url, options)
+    crawler.crawl
+  end
+end

data/staticizer.gemspec ADDED Viewed

@@ -0,0 +1,26 @@
+# coding: utf-8
+lib = File.expand_path('../lib', __FILE__)
+$LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
+require 'staticizer/version'
+Gem::Specification.new do |spec|
+  spec.name          = "staticizer"
+  spec.version       = Staticizer::VERSION
+  spec.authors       = ["Conor Hunt"]
+  spec.email         = ["conor.hunt+git@gmail.com"]
+  spec.description   = %q{A tool to create a static version of a website for hosting on S3. Can be used to create a cheap emergency backup version of a dynamic website.}
+  spec.summary       = %q{A tool to create a static version of a website for hosting on S3.}
+  spec.homepage      = "https://github.com/SquareMill/staticizer"
+  spec.license       = "MIT"
+  spec.files         = `git ls-files`.split($/)
+  spec.executables   = spec.files.grep(%r{^bin/}) { |f| File.basename(f) }
+  spec.test_files    = spec.files.grep(%r{^(test|spec|features)/})
+  spec.require_paths = ["lib"]
+  spec.add_development_dependency "bundler", "~> 1.3"
+  spec.add_development_dependency "rake"
+  spec.add_runtime_dependency 'nokogiri'
+  spec.add_runtime_dependency 'aws-sdk'
+end

data/tests/crawler_test.rb ADDED Viewed

@@ -0,0 +1,15 @@
+require 'minitest/autorun'
+# TODO!
+class TestFilePaths < MiniTest::Unit::TestCase
+  tests = {
+    "" => "index.html"
+    "/" => "index.html"
+    "/asdfdf/dfdf" => "/asdfdf/dfdf"
+    "/asdfdf/dfdf/" => "/asdfdf/dfdf" and "/asdfdf/dfdf/index.html"
+    "/asdfad/asdffd.test" => "/asdfad/asdffd.test"
+    "/?asdfsd=12312" => "/?asdfsd=12312"
+    "/asdfad/asdffd.test?123=sdff" => "/asdfad/asdffd.test?123=sdff"
+  }
+end

metadata ADDED Viewed

@@ -0,0 +1,114 @@
+--- !ruby/object:Gem::Specification
+name: staticizer
+version: !ruby/object:Gem::Version
+  version: 0.0.2
+platform: ruby
+authors:
+- Conor Hunt
+autorequire:
+bindir: bin
+cert_chain: []
+date: 2014-01-14 00:00:00.000000000 Z
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: bundler
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: '1.3'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: '1.3'
+- !ruby/object:Gem::Dependency
+  name: rake
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: nokogiri
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: aws-sdk
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+description: A tool to create a static version of a website for hosting on S3. Can
+  be used to create a cheap emergency backup version of a dynamic website.
+email:
+- conor.hunt+git@gmail.com
+executables:
+- staticizer
+extensions: []
+extra_rdoc_files: []
+files:
+- .gitignore
+- Gemfile
+- LICENSE.txt
+- README.md
+- Rakefile
+- bin/staticizer
+- lib/staticizer.rb
+- lib/staticizer/command.rb
+- lib/staticizer/crawler.rb
+- lib/staticizer/version.rb
+- staticizer.gemspec
+- tests/crawler_test.rb
+homepage: https://github.com/SquareMill/staticizer
+licenses:
+- MIT
+metadata: {}
+post_install_message:
+rdoc_options: []
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - '>='
+    - !ruby/object:Gem::Version
+      version: '0'
+required_rubygems_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - '>='
+    - !ruby/object:Gem::Version
+      version: '0'
+requirements: []
+rubyforge_project:
+rubygems_version: 2.1.9
+signing_key:
+specification_version: 4
+summary: A tool to create a static version of a website for hosting on S3.
+test_files: []