RubyGems - snapcrawl - Versions diffs - 0.4.4 → 0.5.0.rc1 - Mend

snapcrawl 0.4.4 → 0.5.0.rc1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (20) hide show

checksums.yaml +4 -4
data/README.md +68 -61
data/bin/snapcrawl +10 -2
data/lib/snapcrawl.rb +15 -1
data/lib/snapcrawl/cli.rb +55 -0
data/lib/snapcrawl/config.rb +54 -0
data/lib/snapcrawl/crawler.rb +49 -223
data/lib/snapcrawl/dependencies.rb +21 -0
data/lib/snapcrawl/exceptions.rb +1 -0
data/lib/snapcrawl/log_helpers.rb +57 -0
data/lib/snapcrawl/page.rb +111 -0
data/lib/snapcrawl/pretty_logger.rb +11 -0
data/lib/snapcrawl/refinements/pair_split.rb +23 -0
data/lib/snapcrawl/refinements/string_refinements.rb +13 -0
data/lib/snapcrawl/screenshot.rb +62 -0
data/lib/snapcrawl/templates/config.yml +41 -0
data/lib/snapcrawl/templates/docopt.txt +26 -0
data/lib/snapcrawl/version.rb +1 -1
metadata +51 -19
data/lib/snapcrawl/docopt.txt +0 -48

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 99d66eca6d9f9ef3b952591ab2ca355094cd0dcea07669ae76474693d5b4caf4
-  data.tar.gz: 04b4d1a41d8e3b550519f87c9f837ef5ad0ff5b9af7528e317a1dd0943326e12
+  metadata.gz: ced7afea220ea7c23c7207037cb32d02625fc3278e8e2347c0c9327fc0f0e509
+  data.tar.gz: 12e7a758a10cba960027ce2152187aed99cfec0c0ea2a434431a34a11a1e2f04
 SHA512:
-  metadata.gz: 07bb6174d3681559d18cb73ca3c7b18f37c60a13f483ccab530b31845db7019f2413680e4e61907cadf8aeccdf6343bee484bafd444c321978de279ca3bbcda6
-  data.tar.gz: 70bbb5cf8508417b5a98c3af830efb8251cc7216c73ef66b1c31495d573e8289d1b1662779305ff42a6aaafedb7f621ba7942b6b469f17c6513617e98dbe8432
+  metadata.gz: 117c0157a09a7e040c3c487c6f0d51fa20ad9c9a6be965cb8083eb32c6201effa406d0cbbd428190e1ffc41b1097347113e3ce03a88eae273a5d4d6fd2a8c85d
+  data.tar.gz: 5261d94ef0a0a2223963b70fd0bd8cc6c822e31a693d5bbcc8f452e51f92ef519df20decea90351c3e85f3bfaf30e725be1d6a4d76b4d2748663de44a7772e88

data/README.md CHANGED Viewed

@@ -1,5 +1,4 @@
-Snapcrawl - crawl a website and take screenshots
-==================================================
+# Snapcrawl - crawl a website and take screenshots
 [![Gem Version](https://badge.fury.io/rb/snapcrawl.svg)](http://badge.fury.io/rb/snapcrawl)
 [![Build Status](https://github.com/DannyBen/snapcrawl/workflows/Test/badge.svg)](https://github.com/DannyBen/snapcrawl/actions?query=workflow%3ATest)
@@ -11,8 +10,7 @@ Snapcrawl is a command line utility for crawling a website and saving
 screenshots.
-Features
---------------------------------------------------
+## Features
 - Crawls a website to any given depth and saves screenshots
 - Can capture the full length of the page
@@ -21,100 +19,109 @@ Features
 - Uses local caching to avoid expensive crawl operations if not needed
 - Reports broken links
+## Install
-Prerequisites
---------------------------------------------------
-Snapcrawl requires [PhantomJS][1] and [ImageMagick][2].
-Docker Image
---------------------------------------------------
+**Using Docker**
 You can run Snapcrawl by using this docker image (which contains all the
 necessary prerequisites):
-```
-$ docker pull dannyben/snapcrawl
+```shell
+$ alias snapcrawl="docker run --rm -it --volume $PWD:/app dannyben/snapcrawl"
 ```
-Then you can use it like this:
+For more information on the Docker image, refer to the [docker-snapcrawl][3] repository.
-```
-$ docker run --rm -it dannyben/snapcrawl --help
+**Using Ruby**
+```shell
+$ gem install snapcrawl
 ```
-For more information refer to the [docker-snapcrawl][3] repository.
+Note that Snapcrawl requires [PhantomJS][1] and [ImageMagick][2].
+## Usage
-Install
---------------------------------------------------
+Snapcrawl can be configured either through a configuration file (YAML), or by specifying options in the command line.
+```shell
+$ snapcrawl
+Usage:
+  snapcrawl URL [--config FILE] [SETTINGS...]
+  snapcrawl -h | --help
+  snapcrawl -v | --version
 ```
-$ gem install snapcrawl
+The default configuration filename is `snapcrawl.yml`.
+Using the `--config` flag will create a template configuration file if it is not present:
+```shell
+$ snapcrawl example.com --config snapcrawl
 ```
+### Specifying options in the command line
-Usage
---------------------------------------------------
+All configuration options can be specified in the command line as `key=value` pairs:
+```shell
+$ snapcrawl example.com log_level=0 depth=2 width=1024
 ```
-$ snapcrawl --help
-Snapcrawl
+### Sample configuration file
-Usage:
-  snapcrawl URL [options]
-  snapcrawl -h | --help
-  snapcrawl -v | --version
+```yaml
+# All values below are the default values
+# log level (0-4) 0=DEBUG 1=INFO 2=WARN 3=ERROR 4=FATAL
+log_level: 1
-Options:
-  -f, --folder PATH
-    Where to save screenshots [default: snaps]
+# log_color (yes, no, auto)
+# yes  = always show log color
+# no   = never use colors
+# auto = only use colors when running in an interactive terminal
+log_color: auto
-  -n, --name TEMPLATE
-    Filename template. Include the string '%{url}' anywhere in the name to
-    use the captured URL in the filename [default: %{url}]
+# number of levels to crawl, 0 means capture only the root URL
+depth: 1
-  -a, --age SECONDS
-    Number of seconds to consider screenshots fresh [default: 86400]
+# screenshot width in pixels
+width: 1280
-  -d, --depth LEVELS
-    Number of levels to crawl [default: 1]
+# screenshot height in pixels, 0 means the entire height
+height: 0
-  -W, --width PIXELS
-    Screen width in pixels [default: 1280]
+# number of seconds to consider the page cache and its screenshot fresh
+cache_life: 86400
-  -H, --height PIXELS
-    Screen height in pixels. Use 0 to capture the full page [default: 0]
+# where to store the HTML page cache
+cache_dir: cache
-  -s, --selector SELECTOR
-    CSS selector to capture
+# where to store screenshots
+snaps_dir: snaps
-  -o, --only REGEX
-    Include only URLs that match REGEX
+# screenshot filename template, where '%{url}' will be replaced with a
+# slug version of the URL (no need to include the .png extension)
+name_template: '%{url}'
-  -h, --help
-    Show this screen
+# urls not matching this regular expression will be ignored
+url_whitelist:
-  -v, --version
-    Show version number
+# urls matching this regular expression will be ignored
+url_blacklist:
-Examples:
-  snapcrawl example.com
-  snapcrawl example.com -d2 -fscreens
-  snapcrawl example.com -d2 > out.txt 2> err.txt &
-  snapcrawl example.com -W360 -H480
-  snapcrawl example.com --selector "#main-content"
-  snapcrawl example.com --only "products|collections"
-  snapcrawl example.com --name "screenshot-%{url}"
-  snapcrawl example.com --name "`date +%Y%m%d`_%{url}"
+# take a screenshot of this CSS selector only
+css_selector:
 ```
+## Contributing / Support
+If you experience any issue, have a question or a suggestion, or if you wish
+to contribute, feel free to [open an issue][issues].
 ---
 [1]: http://phantomjs.org/download.html
 [2]: https://imagemagick.org/script/download.php
 [3]: https://github.com/DannyBen/docker-snapcrawl
+[issues]: https://github.com/DannyBen/snapcrawl/issues

data/bin/snapcrawl CHANGED Viewed

@@ -1,22 +1,30 @@
 #!/usr/bin/env ruby
 require 'snapcrawl'
+require 'colsole'
 trap(:INT) { abort "\r\nGoodbye" }
 include Snapcrawl
+include Colsole
 begin
-  Crawler.instance.handle ARGV
+  CLI.new.call ARGV
 rescue MissingPhantomJS => e
   message = "Cannot find phantomjs executable in the path, please install it first."
   say! "\n\n!undred!#{e.class}!txtrst!\n#{message}"
   exit 2
 rescue MissingImageMagick=> e
   message = "Cannot find convert (ImageMagick) executable in the path, please install it first."
   say! "\n\n!undred!#{e.class}!txtrst!\n#{message}"
   exit 3
 rescue => e
   puts e.backtrace.reverse if ENV['DEBUG']
-  say! "\n\n!undred!#{e.class}!txtrst!\n#{e.message}"
+  say! "\n!undred!#{e.class}!txtrst!\n#{e.message}"
   exit 1
 end

data/lib/snapcrawl.rb CHANGED Viewed

@@ -1,6 +1,20 @@
 require 'snapcrawl/version'
 require 'snapcrawl/exceptions'
+require 'snapcrawl/refinements/pair_split'
+require 'snapcrawl/refinements/string_refinements'
+require 'snapcrawl/log_helpers'
+require 'snapcrawl/pretty_logger'
+require 'snapcrawl/dependencies'
+require 'snapcrawl/config'
+require 'snapcrawl/screenshot'
+require 'snapcrawl/page'
 require 'snapcrawl/crawler'
+require 'snapcrawl/cli'
-require 'byebug' if ENV['BYEBUG']
+if ENV['BYEBUG']
+  require 'byebug'
+  require 'lp'
+end
+Snapcrawl::Config.load
+$logger = Snapcrawl::PrettyLogger.new

data/lib/snapcrawl/cli.rb ADDED Viewed

@@ -0,0 +1,55 @@
+require 'colsole'
+require 'docopt'
+require 'fileutils'
+module Snapcrawl
+  class CLI
+    include Colsole
+    using StringRefinements
+    using PairSplit
+    def call(args = [])
+      begin
+        execute Docopt::docopt(docopt, version: VERSION, argv: args)
+      rescue Docopt::Exit => e
+        puts e.message
+      end
+    end
+  private
+    def execute(args)
+      status = Config.load args['--config']
+      $logger.debug 'config file created' if status == :created
+      tweaks = args['SETTINGS'].pair_split
+      apply_tweaks tweaks if tweaks
+      Dependencies.verify
+      $logger.debug 'initializing cli'
+      FileUtils.mkdir_p Config.snaps_dir
+      url = args['URL'].protocolize
+      crawler = Crawler.new url
+      crawler.crawl
+    end
+    def docopt
+      @doc ||= File.read docopt_path
+    end
+    def docopt_path
+      File.expand_path "templates/docopt.txt", __dir__
+    end
+    def apply_tweaks(tweaks)
+      tweaks.each do |key, value|
+        Config.settings[key] = value
+        $logger.level = value if key == 'log_level'
+      end
+    end
+  end
+end

data/lib/snapcrawl/config.rb ADDED Viewed

@@ -0,0 +1,54 @@
+require 'sting'
+require 'fileutils'
+module Snapcrawl
+  class Config < Sting
+    class << self
+      def load(file = nil)
+        reset!
+        push defaults
+        return unless file
+        file = "#{file}.yml" unless file =~ /\.ya?ml$/
+        if File.exist? file
+          push file
+        else
+          create_config file
+        end
+      end
+    private
+      def defaults
+        {
+          depth: 1,
+          width: 1280,
+          height: 0,
+          cache_life: 86400,
+          cache_dir: 'cache',
+          snaps_dir: 'snaps',
+          name_template: '%{url}',
+          url_whitelist: nil,
+          css_selector: nil,
+          log_level: 1,
+          log_color: 'auto',
+        }
+      end
+      def create_config(file)
+        $logger.debug "creating config file %{green}#{file}%{reset}"
+        content = File.read config_template
+        dir = File.dirname file
+        FileUtils.mkdir_p dir
+        File.write file, content
+      end
+      def config_template
+        File.expand_path 'templates/config.yml', __dir__
+      end
+    end
+  end
+end

data/lib/snapcrawl/crawler.rb CHANGED Viewed

@@ -1,267 +1,93 @@
-require 'colsole'
-require 'docopt'
 require 'fileutils'
-require 'httparty'
-require 'nokogiri'
-require 'ostruct'
-require 'pstore'
-require 'addressable/uri'
-require 'webshot'
 module Snapcrawl
-  include Colsole
   class Crawler
-    include Singleton
-    def initialize
-      @storefile  = "snapcrawl.pstore"
-      @store      = PStore.new(@storefile)
-    end
+    using StringRefinements
-    def handle(args)
-      @done = []
-      begin
-        execute Docopt::docopt(doc, version: VERSION, argv: args)
-      rescue Docopt::Exit => e
-        puts e.message
-      end
-    end
+    attr_reader :url
-    def execute(args)
-      raise MissingPhantomJS unless command_exist? "phantomjs"
-      raise MissingImageMagick unless command_exist? "convert"
-      crawl args['URL'].dup, opts_from_args(args)
+    def initialize(url)
+      $logger.debug "initializing crawler with %{green}#{url}%{reset}"
+      config_for_display = Config.settings.dup
+      config_for_display['name_template'] = '%%{url}'
+      $logger.debug "config #{config_for_display}"
+      @url = url
     end
-    def clear_cache
-      FileUtils.rm @storefile if File.exist? @storefile
+    def crawl
+      Dependencies.verify
+      todo[url] = Page.new url
+      process_todo while todo.any?
     end
   private
-    def crawl(url, opts={})
-      url = protocolize url
-      defaults = {
-        width: 1280,
-        height: 0,
-        depth: 1,
-        age: 86400,
-        folder: 'snaps',
-        name: '%{url}',
-        base: url,
-      }
-      urls = [url]
-      @opts = OpenStruct.new defaults.merge(opts)
+    def process_todo
+      $logger.debug "processing queue: %{green}#{todo.count} remaining%{reset}"
-      make_screenshot_dir @opts.folder
+      url, page = todo.shift
+      done.push url
-      @opts.depth.times do
-        urls = crawl_and_snap urls
+      if process_page page
+        register_sub_pages page.pages if page.depth < Config.depth
       end
     end
-    def crawl_and_snap(urls)
-      new_urls = []
-      urls.each do |url|
-        next if @done.include? url
-        @done << url
-        say "\n!txtgrn!-----> Visit: #{url}"
-        if @opts.only and url !~ /#{@opts.only}/
-          say "       Snap:  Skipping. Does not match regex"
-        else
-          snap url
+    def register_sub_pages(pages)
+      pages.each do |sub_page|
+        next if todo.has_key?(sub_page) or done.include?(sub_page)
+        if Config.url_whitelist and sub_page.path !~ /#{Config.url_whitelist}/
+          $logger.debug "ignoring %{purple}%{underlined}#{sub_page.url}%{reset}, reason: whitelist"
+          next
         end
-        new_urls += extract_urls_from url
-      end
-      new_urls
-    end
-    # Take a screenshot of a URL, unless we already did so recently
-    def snap(url)
-      file = image_path_for(url)
-      if file_fresh? file
-        say "       Snap:  Skipping. File exists and seems fresh"
-      else
-        snap!(url)
-      end
-    end
-    # Take a screenshot of the URL, even if file exists
-    def snap!(url)
-      say "       !txtblu!Snap!!txtrst!  Snapping picture... "
-      image_path = image_path_for url
+        if Config.url_blacklist and sub_page.path =~ /#{Config.url_blacklist}/
+          $logger.debug "ignoring %{purple}%{underlined}#{sub_page.url}%{reset}, reason: blacklist"
+          next
+        end
-      fetch_opts = { allowed_status_codes: [404, 401, 403] }
-      if @opts.selector
-        fetch_opts[:selector] = @opts.selector
-        fetch_opts[:full] = false
+        todo[sub_page.url] = sub_page
       end
-      webshot_capture url, image_path, fetch_opts
-      say "done"
     end
-    def webshot_capture(url, image_path, fetch_opts)
-      webshot_capture! url, image_path, fetch_opts
-    rescue => e
-      say "!txtred!FAILED"
-      say "!txtred!  !    #{e.class}: #{e.message.strip}"
-    end
+    def process_page(page)
+      outfile = "#{Config.snaps_dir}/#{Config.name_template}.png" % { url: page.url.to_slug }
-    def webshot_capture!(url, image_path, fetch_opts)
-      hide_output do
-        webshot.capture url, image_path, fetch_opts do |magick|
-          magick.combine_options do |c|
-            c.background "white"
-            c.gravity 'north'
-            c.quality 100
-            c.extent @opts.height > 0 ? "#{@opts.width}x#{@opts.height}" : "#{@opts.width}x"
-          end
-        end
-      end
-    end
+      $logger.info "processing %{purple}%{underlined}#{page.url}%{reset}, depth: #{page.depth}"
-    def extract_urls_from(url)
-      cached = nil
-      @store.transaction { cached = @store[url] }
-      if cached
-        say "       Crawl: Page was cached. Reading subsequent URLs from cache"
-        return cached
-      else
-        return extract_urls_from! url
+      if !page.valid?
+        $logger.debug "page #{page.path} is invalid, aborting process"
+        return false
       end
-    end
-    def extract_urls_from!(url)
-      say "       !txtblu!Crawl!!txtrst! Extracting links... "
-      begin
-        response = HTTParty.get url
-        if response.success?
-          doc = Nokogiri::HTML response.body
-          links = doc.css('a')
-          links, warnings = normalize_links links
-          @store.transaction { @store[url] = links }
-          say "done"
-          warnings.each do |warning|
-            say "!txtylw!       Warn:  #{warning[:link]}"
-            say word_wrap "              #{warning[:message]}"
-          end
-        else
-          links = []
-          say "!txtred!FAILED"
-          say "!txtred!  !    HTTP Error: #{response.code} #{response.message.strip} at #{url}"
-        end
+      if file_fresh? outfile
+        $logger.info "screenshot for #{page.path} already exists"
+      else
+        $logger.info "%{bold}capturing screenshot for #{page.path}%{reset}"
+        page.save_screenshot outfile
       end
-      links
-    end
-    # mkdir the screenshots folder, if needed
-    def make_screenshot_dir(dir)
-      Dir.exist? dir or FileUtils.mkdir_p dir
-    end
-    # Convert any string to a proper handle
-    def handelize(str)
-      str.downcase.gsub(/[^a-z0-9]+/, '-')
+      true
     end
-    # Return proper image path for a UR
-    def image_path_for(url)
-      "#{@opts.folder}/#{@opts.name}.png" % { url: handelize(url) }
-    end
-    # Add protocol to a URL if neeed
-    def protocolize(url)
-      url =~ /^http/ ? url : "http://#{url}"
-    end
-    # Return true if the file exists and is not too old
     def file_fresh?(file)
-      @opts.age > 0 and File.exist?(file) and file_age(file) < @opts.age
+      Config.cache_life > 0 and File.exist?(file) and file_age(file) < Config.cache_life
     end
-    # Return file age in seconds
     def file_age(file)
       (Time.now - File.stat(file).mtime).to_i
     end
-    # Process an array of links and return a better one
-    def normalize_links(links)
-      extensions = "png|gif|jpg|pdf|zip"
-      beginnings = "mailto|tel"
-      links_array = []
-      warnings = []
-      links.each do |link|
-        link = link.attribute('href').to_s.dup
-        # Remove #hash
-        link.gsub!(/#.+$/, '')
-        next if link.empty?
-        # Remove links to specific extensions and protocols
-        next if link =~ /\.(#{extensions})(\?.*)?$/
-        next if link =~ /^(#{beginnings})/
-        # Strip spaces
-        link.strip!
-        # Convert relative links to absolute
-        begin
-          link = Addressable::URI.join( @opts.base, link ).to_s.dup
-        rescue => e
-          warnings << { link: link, message: "#{e.class} #{e.message}" }
-          next
-        end
-        # Keep only links in our base domain
-        next unless link.include? @opts.base
-        links_array << link
-      end
-      [links_array.uniq, warnings]
-    end
-    def doc
-      @doc ||= File.read docopt
+    def todo
+      @todo ||= {}
     end
-    def docopt
-      File.expand_path "docopt.txt", __dir__
+    def done
+      @done ||= []
     end
-    def opts_from_args(args)
-      opts = {}
-      %w[folder name selector only].each do |opt|
-        opts[opt.to_sym] = args["--#{opt}"] if args["--#{opt}"]
-      end
-      %w[age depth width height].each do |opt|
-        opts[opt.to_sym] = args["--#{opt}"].to_i if args["--#{opt}"]
-      end
-      opts
-    end
-    def webshot
-      @webshot ||= Webshot::Screenshot.instance
-    end
-    # The webshot gem messes with stdout/stderr streams so we keep it in
-    # check by using this method. Also, in some sites (e.g. uown.co) it
-    # prints some output to stdout, this is why we override $stdout for
-    # the duration of the run.
-    def hide_output
-      keep_stdout, keep_stderr = $stdout, $stderr
-      $stdout, $stderr = StringIO.new, StringIO.new
-      yield
-    ensure
-      $stdout, $stderr = keep_stdout, keep_stderr
-    end
   end
 end

data/lib/snapcrawl/dependencies.rb ADDED Viewed

@@ -0,0 +1,21 @@
+require 'colsole'
+module Snapcrawl
+  class Dependencies
+    class << self
+      include Colsole
+      def verify
+        return if @verified
+        $logger.debug 'verifying %{green}phantomjs%{reset} is present'
+        raise MissingPhantomJS unless command_exist? "phantomjs"
+        $logger.debug 'verifying %{green}imagemagick%{reset} is present'
+        raise MissingImageMagick unless command_exist? "convert"
+        @verified = true
+      end
+    end
+  end
+end

data/lib/snapcrawl/exceptions.rb CHANGED Viewed

@@ -1,4 +1,5 @@
 module Snapcrawl
   class MissingPhantomJS < StandardError; end
   class MissingImageMagick < StandardError; end
+  class ScreenshotError < StandardError; end
 end

data/lib/snapcrawl/log_helpers.rb ADDED Viewed

@@ -0,0 +1,57 @@
+module Snapcrawl
+  module LogHelpers
+    SEVERITY_COLORS = {
+      'INFO' => :blue,
+      'WARN' => :yellow,
+      'ERROR' => :red,
+      'FATAL' => :red,
+      'DEBUG' => :cyan
+    }
+    def log_formatter
+      proc do |severity, _time, _prog, message|
+        severity_color = SEVERITY_COLORS[severity]
+        "%{#{severity_color}}#{severity.rjust 5}%{reset} : #{message}\n" % log_colors
+      end
+    end
+    def log_colors
+      @log_colors ||= log_colors!
+    end
+    def log_colors!
+      colors? ? actual_colors : empty_colors
+    end
+    def actual_colors
+      {
+        red: "\e[31m", green: "\e[32m", yellow: "\e[33m",
+        blue: "\e[34m", purple: "\e[35m", cyan: "\e[36m",
+        underlined: "\e[4m", bold: "\e[1m",
+        none: "", reset: "\e[0m"
+      }
+    end
+    def empty_colors
+      {
+        red: "", green: "", yellow: "",
+        blue: "", purple: "", cyan: "",
+        underlined: "", bold: "",
+        none: "", reset: ""
+      }
+    end
+    def colors?
+      if Config.log_color == 'auto'
+        tty?
+      else
+        Config.log_color
+      end
+    end
+    def tty?
+      ENV['TTY'] == 'on' ? true : ENV['TTY'] == 'off' ? false : $stdout.tty?
+    end
+  end
+end

data/lib/snapcrawl/page.rb ADDED Viewed

@@ -0,0 +1,111 @@
+require 'addressable/uri'
+require 'fileutils'
+require 'httparty'
+require 'lightly'
+require 'nokogiri'
+module Snapcrawl
+  class Page
+    using StringRefinements
+    attr_reader :url, :depth
+    EXTENSION_BLACKLIST = "png|gif|jpg|pdf|zip"
+    PROTOCOL_BLACKLIST = "mailto|tel"
+    def initialize(url, depth: 0)
+      @url, @depth = url.protocolize, depth
+    end
+    def valid?
+      http_response&.success?
+    end
+    def site
+      @site ||= Addressable::URI.parse(url).site
+    end
+    def path
+      @path ||= Addressable::URI.parse(url).request_uri
+    end
+    def links
+      return nil unless valid?
+      doc = Nokogiri::HTML http_response.body
+      normalize_links doc.css('a')
+    end
+    def pages
+      return nil unless valid?
+      links.map { |link| Page.new link, depth: depth+1 }
+    end
+    def save_screenshot(outfile)
+      return false unless valid?
+      Screenshot.new(url).save "#{outfile}"
+    end
+  private
+    def http_response
+      @http_response ||= http_response!
+    end
+    def http_response!
+      response = cache.get(url) { HTTParty.get url }
+      if !response.success?
+        $logger.warn "http error on %{purple}%{underlined}#{url}%{reset}, code: %{yellow}#{response.code}%{reset}, message: #{response.message.strip}"
+      end
+      response
+    rescue => e
+      $logger.error "http error on %{purple}%{underlined}#{url}%{reset} - %{red}#{e.class}%{reset}: #{e.message}"
+      nil
+    end
+    def normalize_links(links)
+      result = []
+      links.each do |link|
+        valid_link = normalize_link link
+        result << valid_link if valid_link
+      end
+      result.uniq
+    end
+    def normalize_link(link)
+      link = link.attribute('href').to_s.dup
+      # Remove #hash
+      link.gsub!(/#.+$/, '')
+      return nil if link.empty?
+      # Remove links to specific extensions and protocols
+      return nil if link =~ /\.(#{EXTENSION_BLACKLIST})(\?.*)?$/
+      return nil if link =~ /^(#{PROTOCOL_BLACKLIST}):/
+      # Strip spaces
+      link.strip!
+      # Convert relative links to absolute
+      begin
+        link = Addressable::URI.join(url, link).to_s.dup
+      rescue => e
+        $logger.warn "%{red}#{e.class}%{reset}: #{e.message} on #{path} (link: #{link})"
+        return nil
+      end
+      # Keep only links in our base domain
+      return nil unless link.include? site
+      link
+    end
+    def cache
+      Lightly.new life: Config.cache_life
+    end
+  end
+end

data/lib/snapcrawl/pretty_logger.rb ADDED Viewed

@@ -0,0 +1,11 @@
+require 'logger'
+module Snapcrawl
+  class PrettyLogger
+    extend LogHelpers
+    def self.new
+      Logger.new(STDOUT, formatter: log_formatter, level: Config.log_level)
+    end
+  end
+end

data/lib/snapcrawl/refinements/pair_split.rb ADDED Viewed

@@ -0,0 +1,23 @@
+module Snapcrawl
+  module PairSplit
+    refine Array do
+      def pair_split
+        map do |pair|
+          key, value = pair.split '='
+          value = if value =~ /^\d+$/
+            value.to_i
+          elsif ['no', 'false'].include? value
+            false
+          elsif ['yes', 'true'].include? value
+            true
+          else
+            value
+          end
+          [key, value]
+        end.to_h
+      end
+    end
+  end
+end

data/lib/snapcrawl/refinements/string_refinements.rb ADDED Viewed

@@ -0,0 +1,13 @@
+module Snapcrawl
+  module StringRefinements
+    refine String do
+      def to_slug
+        downcase.gsub(/[^a-z0-9]+/, '-')
+      end
+      def protocolize
+        self =~ /^http/ ? self : "http://#{self}"
+      end
+    end
+  end
+end

data/lib/snapcrawl/screenshot.rb ADDED Viewed

@@ -0,0 +1,62 @@
+require 'webshot'
+module Snapcrawl
+  class Screenshot
+    using StringRefinements
+    attr_reader :url
+    def initialize(url)
+      @url = url
+    end
+    def save(outfile = nil)
+      outfile ||= "#{url.to_slug}.png"
+      fetch_opts = { allowed_status_codes: [404, 401, 403] }
+      if Config.selector
+        fetch_opts[:selector] = Config.selector
+        fetch_opts[:full] = false
+      end
+      webshot_capture url, outfile, fetch_opts
+    end
+  private
+    def webshot_capture(url, image_path, fetch_opts)
+      webshot_capture! url, image_path, fetch_opts
+    rescue => e
+      raise ScreenshotError, "#{e.class} #{e.message}"
+    end
+    def webshot_capture!(url, image_path, fetch_opts)
+      hide_output do
+        webshot.capture url, image_path, fetch_opts do |magick|
+          magick.combine_options do |c|
+            c.background "white"
+            c.gravity 'north'
+            c.quality 100
+            c.extent Config.height > 0 ? "#{Config.width}x#{Config.height}" : "#{Config.width}x"
+          end
+        end
+      end
+    end
+    def webshot
+      @webshot ||= Webshot::Screenshot.instance
+    end
+    # The webshot gem messes with stdout/stderr streams so we keep it in
+    # check by using this method. Also, in some sites (e.g. uown.co) it
+    # prints some output to stdout, this is why we override $stdout for
+    # the duration of the run.
+    def hide_output
+      keep_stdout, keep_stderr = $stdout, $stderr
+      $stdout, $stderr = StringIO.new, StringIO.new
+      yield
+    ensure
+      $stdout, $stderr = keep_stdout, keep_stderr
+    end
+  end
+end

data/lib/snapcrawl/templates/config.yml ADDED Viewed

@@ -0,0 +1,41 @@
+# All values below are the default values
+# log level (0-4) 0=DEBUG 1=INFO 2=WARN 3=ERROR 4=FATAL
+log_level: 1
+# log_color (yes, no, auto)
+# yes  = always show log color
+# no   = never use colors
+# auto = only use colors when running in an interactive terminal
+log_color: auto
+# number of levels to crawl, 0 means capture only the root URL
+depth: 1
+# screenshot width in pixels
+width: 1280
+# screenshot height in pixels, 0 means the entire height
+height: 0
+# number of seconds to consider the page cache and its screenshot fresh
+cache_life: 86400
+# where to store the HTML page cache
+cache_dir: cache
+# where to store screenshots
+snaps_dir: snaps
+# screenshot filename template, where '%{url}' will be replaced with a
+# slug version of the URL (no need to include the .png extension)
+name_template: '%{url}'
+# urls not matching this regular expression will be ignored
+url_whitelist:
+# urls matching this regular expression will be ignored
+url_blacklist:
+# take a screenshot of this CSS selector only
+css_selector:

data/lib/snapcrawl/templates/docopt.txt ADDED Viewed

@@ -0,0 +1,26 @@
+Snapcrawl
+Usage:
+  snapcrawl URL [--config FILE] [SETTINGS...]
+  snapcrawl -h | --help
+  snapcrawl -v | --version
+Options:
+  -c, --config FILE
+    Path to config file, with or without the .yml extension
+    A sample file will be created if not found
+    [default: snapcrawl.yml]
+  -h, --help
+    Show this screen
+  -v, --version
+    Show version number
+Settings:
+  You may provide any of the options available in the config as 'key=value'.
+Examples:
+  snapcrawl example.com
+  snapcrawl example.com --config simple
+  snapcrawl example.com depth=1 log_level=2

data/lib/snapcrawl/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module Snapcrawl
-  VERSION = "0.4.4"
+  VERSION = "0.5.0.rc1"
 end

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: snapcrawl
 version: !ruby/object:Gem::Version
-  version: 0.4.4
+  version: 0.5.0.rc1
 platform: ruby
 authors:
 - Danny Ben Shitrit
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2020-03-12 00:00:00.000000000 Z
+date: 2020-03-14 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: colsole
@@ -16,48 +16,42 @@ dependencies:
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '0.5'
-    - - ">="
-      - !ruby/object:Gem::Version
-        version: 0.5.4
+        version: '0.7'
   type: :runtime
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '0.5'
-    - - ">="
-      - !ruby/object:Gem::Version
-        version: 0.5.4
+        version: '0.7'
 - !ruby/object:Gem::Dependency
   name: docopt
   requirement: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '0.5'
+        version: '0.6'
   type: :runtime
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '0.5'
+        version: '0.6'
 - !ruby/object:Gem::Dependency
   name: nokogiri
   requirement: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '1.6'
+        version: '1.10'
   type: :runtime
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '1.6'
+        version: '1.10'
 - !ruby/object:Gem::Dependency
   name: webshot
   requirement: !ruby/object:Gem::Requirement
@@ -78,14 +72,14 @@ dependencies:
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '0.17'
+        version: '0.18'
   type: :runtime
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '0.17'
+        version: '0.18'
 - !ruby/object:Gem::Dependency
   name: addressable
   requirement: !ruby/object:Gem::Requirement
@@ -100,6 +94,34 @@ dependencies:
     - - "~>"
       - !ruby/object:Gem::Version
         version: '2.7'
+- !ruby/object:Gem::Dependency
+  name: lightly
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.3'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.3'
+- !ruby/object:Gem::Dependency
+  name: sting
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.4'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.4'
 description: Snapcrawl is a command line utility for crawling a website and saving
   screenshots.
 email: db@dannyben.com
@@ -111,9 +133,19 @@ files:
 - README.md
 - bin/snapcrawl
 - lib/snapcrawl.rb
+- lib/snapcrawl/cli.rb
+- lib/snapcrawl/config.rb
 - lib/snapcrawl/crawler.rb
-- lib/snapcrawl/docopt.txt
+- lib/snapcrawl/dependencies.rb
 - lib/snapcrawl/exceptions.rb
+- lib/snapcrawl/log_helpers.rb
+- lib/snapcrawl/page.rb
+- lib/snapcrawl/pretty_logger.rb
+- lib/snapcrawl/refinements/pair_split.rb
+- lib/snapcrawl/refinements/string_refinements.rb
+- lib/snapcrawl/screenshot.rb
+- lib/snapcrawl/templates/config.yml
+- lib/snapcrawl/templates/docopt.txt
 - lib/snapcrawl/version.rb
 homepage: https://github.com/DannyBen/snapcrawl
 licenses:
@@ -130,9 +162,9 @@ required_ruby_version: !ruby/object:Gem::Requirement
       version: '2.3'
 required_rubygems_version: !ruby/object:Gem::Requirement
   requirements:
-  - - ">="
+  - - ">"
     - !ruby/object:Gem::Version
-      version: '0'
+      version: 1.3.1
 requirements: []
 rubygems_version: 3.0.3
 signing_key:

data/lib/snapcrawl/docopt.txt DELETED Viewed

@@ -1,48 +0,0 @@
-Snapcrawl
-Usage:
-  snapcrawl URL [options]
-  snapcrawl -h | --help
-  snapcrawl -v | --version
-Options:
-  -f, --folder PATH
-    Where to save screenshots [default: snaps]
-  -n, --name TEMPLATE
-    Filename template. Include the string '%{url}' anywhere in the name to
-    use the captured URL in the filename [default: %{url}]
-  -a, --age SECONDS
-    Number of seconds to consider screenshots fresh [default: 86400]
-  -d, --depth LEVELS
-    Number of levels to crawl [default: 1]
-  -W, --width PIXELS
-    Screen width in pixels [default: 1280]
-  -H, --height PIXELS
-    Screen height in pixels. Use 0 to capture the full page [default: 0]
-  -s, --selector SELECTOR
-    CSS selector to capture
-  -o, --only REGEX
-    Include only URLs that match REGEX
-  -h, --help
-    Show this screen
-  -v, --version
-    Show version number
-Examples:
-  snapcrawl example.com
-  snapcrawl example.com -d2 -fscreens
-  snapcrawl example.com -d2 > out.txt 2> err.txt &
-  snapcrawl example.com -W360 -H480
-  snapcrawl example.com --selector "#main-content"
-  snapcrawl example.com --only "products|collections"
-  snapcrawl example.com --name "screenshot-%{url}"
-  snapcrawl example.com --name "`date +%Y%m%d`_%{url}"