snapcrawl 0.4.3 → 0.5.2

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: bce8153d8387fd4b9487b5b807921a7b4e257fae4c1e8ac715d0188e6383211a
4
- data.tar.gz: '0397a7325d6f9b47acf8c5774d31d6045440e725a3f798a1a40bb945168f1e8f'
3
+ metadata.gz: 62a293da259afce5690315f27f2bbcd881e495a3d1b5344eb9ed9e2c46bd4a4d
4
+ data.tar.gz: d600fdbcd2344e5a19f853cbea67a0d8ad0c365a38d00aa4de8d02dd6e52e5b0
5
5
  SHA512:
6
- metadata.gz: a626ca6aa678dcbda8b88df754dba68366b53e38daa1ffb23f6bebbe33d272cf653bbfdbe02c4134e1a2780b0e9e06c784b50dc8ac7db973b48170792d30fa6a
7
- data.tar.gz: 02ceff3026e12416e056fa7ad417b2f134c75f534829a83f045651247f57f298bf054549df7b11c5767801b45a491f6cfd76a204df6eef1743adeb8d3a1ce5d5
6
+ metadata.gz: 3ebdb2355480bacd7f7a6faba264a31086e68c1864c692607fdb6fbc11df210eee17af936ab63305484ee46ac473d50b4033be11e995b51b9050b359c81dd906
7
+ data.tar.gz: 42a0a9f048fe9b5b1b04426d444710a256ccc8e9a914e3277f062c4ebf760d50a018c1f189e7b0cebced1c236f5d13ca56ab4abbf808a5ec4812bf9a754a9343
data/README.md CHANGED
@@ -1,8 +1,7 @@
1
- Snapcrawl - crawl a website and take screenshots
2
- ==================================================
1
+ # Snapcrawl - crawl a website and take screenshots
3
2
 
4
- [![Build Status](https://travis-ci.com/DannyBen/snapcrawl.svg?branch=master)](https://travis-ci.com/DannyBen/snapcrawl)
5
3
  [![Gem Version](https://badge.fury.io/rb/snapcrawl.svg)](http://badge.fury.io/rb/snapcrawl)
4
+ [![Build Status](https://github.com/DannyBen/snapcrawl/workflows/Test/badge.svg)](https://github.com/DannyBen/snapcrawl/actions?query=workflow%3ATest)
6
5
  [![Code Climate](https://codeclimate.com/github/DannyBen/snapcrawl/badges/gpa.svg)](https://codeclimate.com/github/DannyBen/snapcrawl)
7
6
 
8
7
  ---
@@ -11,8 +10,7 @@ Snapcrawl is a command line utility for crawling a website and saving
11
10
  screenshots.
12
11
 
13
12
 
14
- Features
15
- --------------------------------------------------
13
+ ## Features
16
14
 
17
15
  - Crawls a website to any given depth and saves screenshots
18
16
  - Can capture the full length of the page
@@ -21,100 +19,109 @@ Features
21
19
  - Uses local caching to avoid expensive crawl operations if not needed
22
20
  - Reports broken links
23
21
 
22
+ ## Install
24
23
 
25
- Prerequisites
26
- --------------------------------------------------
27
-
28
- Snapcrawl requires [PhantomJS][1] and [ImageMagick][2].
29
-
30
-
31
- Docker Image
32
- --------------------------------------------------
24
+ **Using Docker**
33
25
 
34
26
  You can run Snapcrawl by using this docker image (which contains all the
35
27
  necessary prerequisites):
36
28
 
37
- ```
38
- $ docker pull dannyben/snapcrawl
29
+ ```shell
30
+ $ alias snapcrawl='docker run --rm -it --network host --volume "$PWD:/app" dannyben/snapcrawl'
39
31
  ```
40
32
 
41
- Then you can use it like this:
33
+ For more information on the Docker image, refer to the [docker-snapcrawl][3] repository.
42
34
 
43
- ```
44
- $ docker run --rm -it dannyben/snapcrawl --help
35
+ **Using Ruby**
36
+
37
+ ```shell
38
+ $ gem install snapcrawl
45
39
  ```
46
40
 
47
- For more information refer to the [docker-snapcrawl][3] repository.
41
+ Note that Snapcrawl requires [PhantomJS][1] and [ImageMagick][2].
48
42
 
43
+ ## Usage
49
44
 
50
- Install
51
- --------------------------------------------------
45
+ Snapcrawl can be configured either through a configuration file (YAML), or by specifying options in the command line.
52
46
 
47
+ ```shell
48
+ $ snapcrawl
49
+ Usage:
50
+ snapcrawl URL [--config FILE] [SETTINGS...]
51
+ snapcrawl -h | --help
52
+ snapcrawl -v | --version
53
53
  ```
54
- $ gem install snapcrawl
54
+
55
+ The default configuration filename is `snapcrawl.yml`.
56
+
57
+ Using the `--config` flag will create a template configuration file if it is not present:
58
+
59
+ ```shell
60
+ $ snapcrawl example.com --config snapcrawl
55
61
  ```
56
62
 
63
+ ### Specifying options in the command line
57
64
 
58
- Usage
59
- --------------------------------------------------
65
+ All configuration options can be specified in the command line as `key=value` pairs:
60
66
 
67
+ ```shell
68
+ $ snapcrawl example.com log_level=0 depth=2 width=1024
61
69
  ```
62
- $ snapcrawl --help
63
70
 
64
- Snapcrawl
71
+ ### Sample configuration file
65
72
 
66
- Usage:
67
- snapcrawl URL [options]
68
- snapcrawl -h | --help
69
- snapcrawl -v | --version
73
+ ```yaml
74
+ # All values below are the default values
75
+
76
+ # log level (0-4) 0=DEBUG 1=INFO 2=WARN 3=ERROR 4=FATAL
77
+ log_level: 1
70
78
 
71
- Options:
72
- -f, --folder PATH
73
- Where to save screenshots [default: snaps]
79
+ # log_color (yes, no, auto)
80
+ # yes = always show log color
81
+ # no = never use colors
82
+ # auto = only use colors when running in an interactive terminal
83
+ log_color: auto
74
84
 
75
- -n, --name TEMPLATE
76
- Filename template. Include the string '%{url}' anywhere in the name to
77
- use the captured URL in the filename [default: %{url}]
85
+ # number of levels to crawl, 0 means capture only the root URL
86
+ depth: 1
78
87
 
79
- -a, --age SECONDS
80
- Number of seconds to consider screenshots fresh [default: 86400]
88
+ # screenshot width in pixels
89
+ width: 1280
81
90
 
82
- -d, --depth LEVELS
83
- Number of levels to crawl [default: 1]
91
+ # screenshot height in pixels, 0 means the entire height
92
+ height: 0
84
93
 
85
- -W, --width PIXELS
86
- Screen width in pixels [default: 1280]
94
+ # number of seconds to consider the page cache and its screenshot fresh
95
+ cache_life: 86400
87
96
 
88
- -H, --height PIXELS
89
- Screen height in pixels. Use 0 to capture the full page [default: 0]
97
+ # where to store the HTML page cache
98
+ cache_dir: cache
90
99
 
91
- -s, --selector SELECTOR
92
- CSS selector to capture
100
+ # where to store screenshots
101
+ snaps_dir: snaps
93
102
 
94
- -o, --only REGEX
95
- Include only URLs that match REGEX
103
+ # screenshot filename template, where '%{url}' will be replaced with a
104
+ # slug version of the URL (no need to include the .png extension)
105
+ name_template: '%{url}'
96
106
 
97
- -h, --help
98
- Show this screen
107
+ # urls not matching this regular expression will be ignored
108
+ url_whitelist:
99
109
 
100
- -v, --version
101
- Show version number
110
+ # urls matching this regular expression will be ignored
111
+ url_blacklist:
102
112
 
103
- Examples:
104
- snapcrawl example.com
105
- snapcrawl example.com -d2 -fscreens
106
- snapcrawl example.com -d2 > out.txt 2> err.txt &
107
- snapcrawl example.com -W360 -H480
108
- snapcrawl example.com --selector "#main-content"
109
- snapcrawl example.com --only "products|collections"
110
- snapcrawl example.com --name "screenshot-%{url}"
111
- snapcrawl example.com --name "`date +%Y%m%d`_%{url}"
113
+ # take a screenshot of this CSS selector only
114
+ css_selector:
112
115
  ```
113
116
 
117
+ ## Contributing / Support
118
+ If you experience any issue, have a question or a suggestion, or if you wish
119
+ to contribute, feel free to [open an issue][issues].
120
+
114
121
  ---
115
122
 
116
123
  [1]: http://phantomjs.org/download.html
117
124
  [2]: https://imagemagick.org/script/download.php
118
125
  [3]: https://github.com/DannyBen/docker-snapcrawl
119
-
126
+ [issues]: https://github.com/DannyBen/snapcrawl/issues
120
127
 
data/bin/snapcrawl CHANGED
@@ -1,22 +1,30 @@
1
1
  #!/usr/bin/env ruby
2
2
 
3
3
  require 'snapcrawl'
4
+ require 'colsole'
5
+
4
6
  trap(:INT) { abort "\r\nGoodbye" }
7
+
5
8
  include Snapcrawl
9
+ include Colsole
6
10
 
7
11
  begin
8
- Crawler.instance.handle ARGV
12
+ CLI.new.call ARGV
13
+
9
14
  rescue MissingPhantomJS => e
10
15
  message = "Cannot find phantomjs executable in the path, please install it first."
11
16
  say! "\n\n!undred!#{e.class}!txtrst!\n#{message}"
12
17
  exit 2
18
+
13
19
  rescue MissingImageMagick=> e
14
20
  message = "Cannot find convert (ImageMagick) executable in the path, please install it first."
15
21
  say! "\n\n!undred!#{e.class}!txtrst!\n#{message}"
16
22
  exit 3
23
+
17
24
  rescue => e
18
25
  puts e.backtrace.reverse if ENV['DEBUG']
19
- say! "\n\n!undred!#{e.class}!txtrst!\n#{e.message}"
26
+ say! "\n!undred!#{e.class}!txtrst!\n#{e.message}"
20
27
  exit 1
28
+
21
29
  end
22
30
 
data/lib/snapcrawl.rb CHANGED
@@ -1,6 +1,20 @@
1
1
  require 'snapcrawl/version'
2
2
  require 'snapcrawl/exceptions'
3
+ require 'snapcrawl/refinements/pair_split'
4
+ require 'snapcrawl/refinements/string_refinements'
5
+ require 'snapcrawl/log_helpers'
6
+ require 'snapcrawl/pretty_logger'
7
+ require 'snapcrawl/dependencies'
8
+ require 'snapcrawl/config'
9
+ require 'snapcrawl/screenshot'
10
+ require 'snapcrawl/page'
3
11
  require 'snapcrawl/crawler'
12
+ require 'snapcrawl/cli'
4
13
 
5
- require 'byebug' if ENV['BYEBUG']
14
+ if ENV['BYEBUG']
15
+ require 'byebug'
16
+ require 'lp'
17
+ end
6
18
 
19
+ Snapcrawl::Config.load
20
+ $logger = Snapcrawl::PrettyLogger.new
@@ -0,0 +1,55 @@
1
+ require 'colsole'
2
+ require 'docopt'
3
+ require 'fileutils'
4
+
5
+ module Snapcrawl
6
+ class CLI
7
+ include Colsole
8
+ using StringRefinements
9
+ using PairSplit
10
+
11
+ def call(args = [])
12
+ begin
13
+ execute Docopt::docopt(docopt, version: VERSION, argv: args)
14
+ rescue Docopt::Exit => e
15
+ puts e.message
16
+ end
17
+ end
18
+
19
+ private
20
+
21
+ def execute(args)
22
+ config_file = args['--config']
23
+ Config.load config_file if config_file
24
+
25
+ tweaks = args['SETTINGS'].pair_split
26
+ apply_tweaks tweaks if tweaks
27
+
28
+ Dependencies.verify
29
+
30
+ $logger.debug 'initializing cli'
31
+ FileUtils.mkdir_p Config.snaps_dir
32
+
33
+ url = args['URL'].protocolize
34
+ crawler = Crawler.new url
35
+
36
+ crawler.crawl
37
+ end
38
+
39
+ def docopt
40
+ @doc ||= File.read docopt_path
41
+ end
42
+
43
+ def docopt_path
44
+ File.expand_path "templates/docopt.txt", __dir__
45
+ end
46
+
47
+ def apply_tweaks(tweaks)
48
+ tweaks.each do |key, value|
49
+ Config.settings[key] = value
50
+ $logger.level = value if key == 'log_level'
51
+ end
52
+ end
53
+
54
+ end
55
+ end
@@ -0,0 +1,59 @@
1
+ require 'sting'
2
+ require 'fileutils'
3
+
4
+ module Snapcrawl
5
+ class Config < Sting
6
+ class << self
7
+ def load(file = nil)
8
+ reset!
9
+ push defaults
10
+
11
+ return unless file
12
+
13
+ file = "#{file}.yml" unless file =~ /\.ya?ml$/
14
+
15
+ # FIXME: Cannot use logger here due to the "chicken and egg" with
16
+ # Config. The $logger is available, but it was not yet fully
17
+ # configured with log_level etc.
18
+ if File.exist? file
19
+ # $logger.debug "loading config file !txtgrn!#{file}"
20
+ push file
21
+ else
22
+ # $logger.debug "creating config file !txtgrn!#{file}"
23
+ create_config file
24
+ end
25
+ end
26
+
27
+ private
28
+
29
+ def defaults
30
+ {
31
+ depth: 1,
32
+ width: 1280,
33
+ height: 0,
34
+ cache_life: 86400,
35
+ cache_dir: 'cache',
36
+ snaps_dir: 'snaps',
37
+ name_template: '%{url}',
38
+ url_whitelist: nil,
39
+ url_blacklist: nil,
40
+ css_selector: nil,
41
+ log_level: 1,
42
+ log_color: 'auto',
43
+ }
44
+ end
45
+
46
+ def create_config(file)
47
+ content = File.read config_template
48
+ dir = File.dirname file
49
+ FileUtils.mkdir_p dir
50
+ File.write file, content
51
+ end
52
+
53
+ def config_template
54
+ File.expand_path 'templates/config.yml', __dir__
55
+ end
56
+
57
+ end
58
+ end
59
+ end
@@ -1,257 +1,99 @@
1
- require 'colsole'
2
- require 'docopt'
3
1
  require 'fileutils'
4
- require 'httparty'
5
- require 'nokogiri'
6
- require 'ostruct'
7
- require 'pstore'
8
- require 'addressable/uri'
9
- require 'webshot'
10
2
 
11
3
  module Snapcrawl
12
- include Colsole
13
-
14
4
  class Crawler
15
- include Singleton
16
-
17
- def initialize
18
- @storefile = "snapcrawl.pstore"
19
- @store = PStore.new(@storefile)
20
- end
5
+ using StringRefinements
21
6
 
22
- def handle(args)
23
- @done = []
24
- begin
25
- execute Docopt::docopt(doc, version: VERSION, argv: args)
26
- rescue Docopt::Exit => e
27
- puts e.message
28
- end
29
- end
7
+ attr_reader :url
30
8
 
31
- def execute(args)
32
- raise MissingPhantomJS unless command_exist? "phantomjs"
33
- raise MissingImageMagick unless command_exist? "convert"
34
- crawl args['URL'].dup, opts_from_args(args)
9
+ def initialize(url)
10
+ $logger.debug "initializing crawler with !txtgrn!#{url}"
11
+
12
+ config_for_display = Config.settings.dup
13
+ config_for_display['name_template'] = '%%{url}'
14
+
15
+ $logger.debug "config #{config_for_display}"
16
+ @url = url
35
17
  end
36
18
 
37
- def clear_cache
38
- FileUtils.rm @storefile if File.exist? @storefile
19
+ def crawl
20
+ Dependencies.verify
21
+ todo[url] = Page.new url
22
+ process_todo while todo.any?
39
23
  end
40
24
 
41
25
  private
42
26
 
43
- def crawl(url, opts={})
44
- url = protocolize url
45
- defaults = {
46
- width: 1280,
47
- height: 0,
48
- depth: 1,
49
- age: 86400,
50
- folder: 'snaps',
51
- name: '%{url}',
52
- base: url,
53
- }
54
- urls = [url]
55
-
56
- @opts = OpenStruct.new defaults.merge(opts)
27
+ def process_todo
28
+ $logger.debug "processing queue: !txtgrn!#{todo.count} remaining"
57
29
 
58
- make_screenshot_dir @opts.folder
30
+ url, page = todo.shift
31
+ done.push url
59
32
 
60
- @opts.depth.times do
61
- urls = crawl_and_snap urls
33
+ if process_page page
34
+ register_sub_pages page.pages if page.depth < Config.depth
62
35
  end
63
36
  end
64
37
 
65
- def crawl_and_snap(urls)
66
- new_urls = []
67
- urls.each do |url|
68
- next if @done.include? url
69
- @done << url
70
- say "\n!txtgrn!-----> Visit: #{url}"
71
- if @opts.only and url !~ /#{@opts.only}/
72
- say " Snap: Skipping. Does not match regex"
73
- else
74
- snap url
38
+ def register_sub_pages(pages)
39
+ pages.each do |sub_page|
40
+ next if todo.has_key?(sub_page) or done.include?(sub_page)
41
+
42
+ if Config.url_whitelist and sub_page.path !~ /#{Config.url_whitelist}/
43
+ $logger.debug "ignoring !undpur!#{sub_page.url}!txtrst!, reason: whitelist"
44
+ next
75
45
  end
76
- new_urls += extract_urls_from url
77
- end
78
- new_urls
79
- end
80
-
81
- # Take a screenshot of a URL, unless we already did so recently
82
- def snap(url)
83
- file = image_path_for(url)
84
- if file_fresh? file
85
- say " Snap: Skipping. File exists and seems fresh"
86
- else
87
- snap!(url)
88
- end
89
- end
90
46
 
91
- # Take a screenshot of the URL, even if file exists
92
- def snap!(url)
93
- say " !txtblu!Snap!!txtrst! Snapping picture... "
94
- image_path = image_path_for url
95
-
96
- fetch_opts = { allowed_status_codes: [404, 401, 403] }
97
- if @opts.selector
98
- fetch_opts[:selector] = @opts.selector
99
- fetch_opts[:full] = false
100
- end
101
-
102
- hide_output do
103
- webshot.capture url, image_path, fetch_opts do |magick|
104
- magick.combine_options do |c|
105
- c.background "white"
106
- c.gravity 'north'
107
- c.quality 100
108
- c.extent @opts.height > 0 ? "#{@opts.width}x#{@opts.height}" : "#{@opts.width}x"
109
- end
47
+ if Config.url_blacklist and sub_page.path =~ /#{Config.url_blacklist}/
48
+ $logger.debug "ignoring !undpur!#{sub_page.url}!txtrst!, reason: blacklist"
49
+ next
110
50
  end
111
- end
112
51
 
113
- say "done"
114
- end
115
-
116
- def extract_urls_from(url)
117
- cached = nil
118
- @store.transaction { cached = @store[url] }
119
- if cached
120
- say " Crawl: Page was cached. Reading subsequent URLs from cache"
121
- return cached
122
- else
123
- return extract_urls_from! url
52
+ todo[sub_page.url] = sub_page
124
53
  end
125
54
  end
126
55
 
127
- def extract_urls_from!(url)
128
- say " !txtblu!Crawl!!txtrst! Extracting links... "
56
+ def process_page(page)
57
+ outfile = "#{Config.snaps_dir}/#{Config.name_template}.png" % { url: page.url.to_slug }
129
58
 
130
- begin
131
- response = HTTParty.get url
132
- if response.success?
133
- doc = Nokogiri::HTML response.body
134
- links = doc.css('a')
135
- links, warnings = normalize_links links
136
- @store.transaction { @store[url] = links }
137
- say "done"
138
- warnings.each do |warning|
139
- say "!txtylw! Warn: #{warning[:link]}"
140
- say word_wrap " #{warning[:message]}"
141
- end
142
- else
143
- links = []
144
- say "!txtred!FAILED"
145
- say "!txtred! ! HTTP Error: #{response.code} #{response.message.strip} at #{url}"
146
- end
147
- end
148
- links
149
- end
59
+ $logger.info "processing !undpur!#{page.url}!txtrst!, depth: #{page.depth}"
150
60
 
151
- # mkdir the screenshots folder, if needed
152
- def make_screenshot_dir(dir)
153
- Dir.exist? dir or FileUtils.mkdir_p dir
154
- end
61
+ if !page.valid?
62
+ $logger.debug "page #{page.path} is invalid, aborting process"
63
+ return false
64
+ end
155
65
 
156
- # Convert any string to a proper handle
157
- def handelize(str)
158
- str.downcase.gsub(/[^a-z0-9]+/, '-')
159
- end
66
+ if file_fresh? outfile
67
+ $logger.info "screenshot for #{page.path} already exists"
68
+ else
69
+ $logger.info "!bldgrn!capturing screenshot for #{page.path}"
70
+ save_screenshot page, outfile
71
+ end
160
72
 
161
- # Return proper image path for a UR
162
- def image_path_for(url)
163
- "#{@opts.folder}/#{@opts.name}.png" % { url: handelize(url) }
73
+ true
164
74
  end
165
75
 
166
- # Add protocol to a URL if neeed
167
- def protocolize(url)
168
- url =~ /^http/ ? url : "http://#{url}"
76
+ def save_screenshot(page, outfile)
77
+ page.save_screenshot outfile
78
+ rescue => e
79
+ $logger.error "screenshot error on !undpur!#{page.path}!txtrst! - !txtred!#{e.class}!txtrst!: #{e.message}"
169
80
  end
170
81
 
171
- # Return true if the file exists and is not too old
172
82
  def file_fresh?(file)
173
- @opts.age > 0 and File.exist?(file) and file_age(file) < @opts.age
83
+ Config.cache_life > 0 and File.exist?(file) and file_age(file) < Config.cache_life
174
84
  end
175
85
 
176
- # Return file age in seconds
177
86
  def file_age(file)
178
87
  (Time.now - File.stat(file).mtime).to_i
179
88
  end
180
89
 
181
- # Process an array of links and return a better one
182
- def normalize_links(links)
183
- extensions = "png|gif|jpg|pdf|zip"
184
- beginnings = "mailto|tel"
185
-
186
- links_array = []
187
- warnings = []
188
-
189
- links.each do |link|
190
- link = link.attribute('href').to_s.dup
191
-
192
- # Remove #hash
193
- link.gsub!(/#.+$/, '')
194
- next if link.empty?
195
-
196
- # Remove links to specific extensions and protocols
197
- next if link =~ /\.(#{extensions})(\?.*)?$/
198
- next if link =~ /^(#{beginnings})/
199
-
200
- # Strip spaces
201
- link.strip!
202
-
203
- # Convert relative links to absolute
204
- begin
205
- link = Addressable::URI.join( @opts.base, link ).to_s.dup
206
- rescue => e
207
- warnings << { link: link, message: "#{e.class} #{e.message}" }
208
- next
209
- end
210
-
211
- # Keep only links in our base domain
212
- next unless link.include? @opts.base
213
-
214
- links_array << link
215
- end
216
-
217
- [links_array.uniq, warnings]
218
- end
219
-
220
- def doc
221
- @doc ||= File.read docopt
222
- end
223
-
224
- def docopt
225
- File.expand_path "docopt.txt", __dir__
226
- end
227
-
228
- def opts_from_args(args)
229
- opts = {}
230
- %w[folder name selector only].each do |opt|
231
- opts[opt.to_sym] = args["--#{opt}"] if args["--#{opt}"]
232
- end
233
-
234
- %w[age depth width height].each do |opt|
235
- opts[opt.to_sym] = args["--#{opt}"].to_i if args["--#{opt}"]
236
- end
237
-
238
- opts
90
+ def todo
91
+ @todo ||= {}
239
92
  end
240
93
 
241
- def webshot
242
- @webshot ||= Webshot::Screenshot.instance
94
+ def done
95
+ @done ||= []
243
96
  end
244
97
 
245
- # The webshot gem messes with stdout/stderr streams so we keep it in
246
- # check by using this method. Also, in some sites (e.g. uown.co) it
247
- # prints some output to stdout, this is why we override $stdout for
248
- # the duration of the run.
249
- def hide_output
250
- keep_stdout, keep_stderr = $stdout, $stderr
251
- $stdout, $stderr = StringIO.new, StringIO.new
252
- yield
253
- ensure
254
- $stdout, $stderr = keep_stdout, keep_stderr
255
- end
256
98
  end
257
99
  end
@@ -0,0 +1,21 @@
1
+ require 'colsole'
2
+
3
+ module Snapcrawl
4
+ class Dependencies
5
+ class << self
6
+ include Colsole
7
+
8
+ def verify
9
+ return if @verified
10
+
11
+ $logger.debug 'verifying !txtgrn!phantomjs!txtrst! is present'
12
+ raise MissingPhantomJS unless command_exist? "phantomjs"
13
+
14
+ $logger.debug 'verifying !txtgrn!imagemagick!txtrst! is present'
15
+ raise MissingImageMagick unless command_exist? "convert"
16
+
17
+ @verified = true
18
+ end
19
+ end
20
+ end
21
+ end
@@ -1,4 +1,5 @@
1
1
  module Snapcrawl
2
2
  class MissingPhantomJS < StandardError; end
3
3
  class MissingImageMagick < StandardError; end
4
+ class ScreenshotError < StandardError; end
4
5
  end
@@ -0,0 +1,35 @@
1
+ require 'colsole'
2
+
3
+ module Snapcrawl
4
+ module LogHelpers
5
+ include Colsole
6
+
7
+ SEVERITY_COLORS = {
8
+ 'INFO' => :txtblu,
9
+ 'WARN' => :txtylw,
10
+ 'ERROR' => :txtred,
11
+ 'FATAL' => :txtred,
12
+ 'DEBUG' => :txtcyn
13
+ }
14
+
15
+ def log_formatter
16
+ proc do |severity, _time, _prog, message|
17
+ severity_color = SEVERITY_COLORS[severity]
18
+ line = "!#{severity_color}!#{severity.rjust 5}!txtrst! : #{message}\n"
19
+ use_colors? ? colorize(line) : strip_color_markers(line)
20
+ end
21
+ end
22
+
23
+ def use_colors?
24
+ @use_colors ||= (Config.log_color == 'auto' ? tty? : Config.log_color)
25
+ end
26
+
27
+ def tty?
28
+ ENV['TTY'] == 'on' ? true : ENV['TTY'] == 'off' ? false : $stdout.tty?
29
+ end
30
+
31
+ def strip_color_markers(text)
32
+ text.gsub(/\!([a-z]{6})\!/, '')
33
+ end
34
+ end
35
+ end
@@ -0,0 +1,111 @@
1
+ require 'addressable/uri'
2
+ require 'fileutils'
3
+ require 'httparty'
4
+ require 'lightly'
5
+ require 'nokogiri'
6
+
7
+ module Snapcrawl
8
+ class Page
9
+ using StringRefinements
10
+
11
+ attr_reader :url, :depth
12
+
13
+ EXTENSION_BLACKLIST = "png|gif|jpg|pdf|zip"
14
+ PROTOCOL_BLACKLIST = "mailto|tel"
15
+
16
+ def initialize(url, depth: 0)
17
+ @url, @depth = url.protocolize, depth
18
+ end
19
+
20
+ def valid?
21
+ http_response&.success?
22
+ end
23
+
24
+ def site
25
+ @site ||= Addressable::URI.parse(url).site
26
+ end
27
+
28
+ def path
29
+ @path ||= Addressable::URI.parse(url).request_uri
30
+ end
31
+
32
+ def links
33
+ return nil unless valid?
34
+ doc = Nokogiri::HTML http_response.body
35
+ normalize_links doc.css('a')
36
+ end
37
+
38
+ def pages
39
+ return nil unless valid?
40
+ links.map { |link| Page.new link, depth: depth+1 }
41
+ end
42
+
43
+ def save_screenshot(outfile)
44
+ return false unless valid?
45
+ Screenshot.new(url).save "#{outfile}"
46
+ end
47
+
48
+ private
49
+
50
+ def http_response
51
+ @http_response ||= http_response!
52
+ end
53
+
54
+ def http_response!
55
+ response = cache.get(url) { HTTParty.get url }
56
+
57
+ if !response.success?
58
+ $logger.warn "http error on !undpur!#{url}!txtrst!, code: !txtylw!#{response.code}!txtrst!, message: #{response.message.strip}"
59
+ end
60
+
61
+ response
62
+
63
+ rescue => e
64
+ $logger.error "http error on !undpur!#{url}!txtrst! - !txtred!#{e.class}!txtrst!: #{e.message}"
65
+ nil
66
+
67
+ end
68
+
69
+ def normalize_links(links)
70
+ result = []
71
+
72
+ links.each do |link|
73
+ valid_link = normalize_link link
74
+ result << valid_link if valid_link
75
+ end
76
+
77
+ result.uniq
78
+ end
79
+
80
+ def normalize_link(link)
81
+ link = link.attribute('href').to_s.dup
82
+
83
+ # Remove #hash
84
+ link.gsub!(/#.+$/, '')
85
+ return nil if link.empty?
86
+
87
+ # Remove links to specific extensions and protocols
88
+ return nil if link =~ /\.(#{EXTENSION_BLACKLIST})(\?.*)?$/
89
+ return nil if link =~ /^(#{PROTOCOL_BLACKLIST}):/
90
+
91
+ # Strip spaces
92
+ link.strip!
93
+
94
+ # Convert relative links to absolute
95
+ begin
96
+ link = Addressable::URI.join(url, link).to_s.dup
97
+ rescue => e
98
+ $logger.warn "!txtred!#{e.class}!txtrst!: #{e.message} on #{path} (link: #{link})"
99
+ return nil
100
+ end
101
+
102
+ # Keep only links in our base domain
103
+ return nil unless link.include? site
104
+ link
105
+ end
106
+
107
+ def cache
108
+ Lightly.new life: Config.cache_life
109
+ end
110
+ end
111
+ end
@@ -0,0 +1,11 @@
1
+ require 'logger'
2
+
3
+ module Snapcrawl
4
+ class PrettyLogger
5
+ extend LogHelpers
6
+
7
+ def self.new
8
+ Logger.new($stdout, formatter: log_formatter, level: Config.log_level)
9
+ end
10
+ end
11
+ end
@@ -0,0 +1,23 @@
1
+ module Snapcrawl
2
+ module PairSplit
3
+ refine Array do
4
+ def pair_split
5
+ map do |pair|
6
+ key, value = pair.split '='
7
+
8
+ value = if value =~ /^\d+$/
9
+ value.to_i
10
+ elsif ['no', 'false'].include? value
11
+ false
12
+ elsif ['yes', 'true'].include? value
13
+ true
14
+ else
15
+ value
16
+ end
17
+
18
+ [key, value]
19
+ end.to_h
20
+ end
21
+ end
22
+ end
23
+ end
@@ -0,0 +1,13 @@
1
+ module Snapcrawl
2
+ module StringRefinements
3
+ refine String do
4
+ def to_slug
5
+ downcase.gsub(/[^a-z0-9]+/, '-')
6
+ end
7
+
8
+ def protocolize
9
+ self =~ /^http/ ? self : "http://#{self}"
10
+ end
11
+ end
12
+ end
13
+ end
@@ -0,0 +1,62 @@
1
+ require 'webshot'
2
+
3
+ module Snapcrawl
4
+ class Screenshot
5
+ using StringRefinements
6
+
7
+ attr_reader :url
8
+
9
+ def initialize(url)
10
+ @url = url
11
+ end
12
+
13
+ def save(outfile = nil)
14
+ outfile ||= "#{url.to_slug}.png"
15
+
16
+ fetch_opts = { allowed_status_codes: [404, 401, 403] }
17
+ if Config.selector
18
+ fetch_opts[:selector] = Config.selector
19
+ fetch_opts[:full] = false
20
+ end
21
+
22
+ webshot_capture url, outfile, fetch_opts
23
+ end
24
+
25
+ private
26
+
27
+ def webshot_capture(url, image_path, fetch_opts)
28
+ webshot_capture! url, image_path, fetch_opts
29
+ rescue => e
30
+ raise ScreenshotError, "#{e.class} #{e.message}"
31
+ end
32
+
33
+ def webshot_capture!(url, image_path, fetch_opts)
34
+ hide_output do
35
+ webshot.capture url, image_path, fetch_opts do |magick|
36
+ magick.combine_options do |c|
37
+ c.background "white"
38
+ c.gravity 'north'
39
+ c.quality 100
40
+ c.extent Config.height > 0 ? "#{Config.width}x#{Config.height}" : "#{Config.width}x"
41
+ end
42
+ end
43
+ end
44
+ end
45
+
46
+ def webshot
47
+ @webshot ||= Webshot::Screenshot.instance
48
+ end
49
+
50
+ # The webshot gem messes with stdout/stderr streams so we keep it in
51
+ # check by using this method. Also, in some sites (e.g. uown.co) it
52
+ # prints some output to stdout, this is why we override $stdout for
53
+ # the duration of the run.
54
+ def hide_output
55
+ keep_stdout, keep_stderr = $stdout, $stderr
56
+ $stdout, $stderr = StringIO.new, StringIO.new
57
+ yield
58
+ ensure
59
+ $stdout, $stderr = keep_stdout, keep_stderr
60
+ end
61
+ end
62
+ end
@@ -0,0 +1,41 @@
1
+ # All values below are the default values
2
+
3
+ # log level (0-4) 0=DEBUG 1=INFO 2=WARN 3=ERROR 4=FATAL
4
+ log_level: 1
5
+
6
+ # log_color (yes, no, auto)
7
+ # yes = always show log color
8
+ # no = never use colors
9
+ # auto = only use colors when running in an interactive terminal
10
+ log_color: auto
11
+
12
+ # number of levels to crawl, 0 means capture only the root URL
13
+ depth: 1
14
+
15
+ # screenshot width in pixels
16
+ width: 1280
17
+
18
+ # screenshot height in pixels, 0 means the entire height
19
+ height: 0
20
+
21
+ # number of seconds to consider the page cache and its screenshot fresh
22
+ cache_life: 86400
23
+
24
+ # where to store the HTML page cache
25
+ cache_dir: cache
26
+
27
+ # where to store screenshots
28
+ snaps_dir: snaps
29
+
30
+ # screenshot filename template, where '%{url}' will be replaced with a
31
+ # slug version of the URL (no need to include the .png extension)
32
+ name_template: '%{url}'
33
+
34
+ # urls not matching this regular expression will be ignored
35
+ url_whitelist:
36
+
37
+ # urls matching this regular expression will be ignored
38
+ url_blacklist:
39
+
40
+ # take a screenshot of this CSS selector only
41
+ css_selector:
@@ -0,0 +1,26 @@
1
+ Snapcrawl
2
+
3
+ Usage:
4
+ snapcrawl URL [--config FILE] [SETTINGS...]
5
+ snapcrawl -h | --help
6
+ snapcrawl -v | --version
7
+
8
+ Options:
9
+ -c, --config FILE
10
+ Path to config file, with or without the .yml extension.
11
+ A sample file will be created if not found.
12
+ The default filename is 'snapcrawl.yml'.
13
+
14
+ -h, --help
15
+ Show this screen
16
+
17
+ -v, --version
18
+ Show version number
19
+
20
+ Settings:
21
+ Provide any of the options available in the config as 'key=value'.
22
+
23
+ Examples:
24
+ snapcrawl example.com
25
+ snapcrawl example.com --config simple
26
+ snapcrawl example.com depth=1 log_level=2 width=768
@@ -1,3 +1,3 @@
1
1
  module Snapcrawl
2
- VERSION = "0.4.3"
2
+ VERSION = "0.5.2"
3
3
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: snapcrawl
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.4.3
4
+ version: 0.5.2
5
5
  platform: ruby
6
6
  authors:
7
7
  - Danny Ben Shitrit
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2020-01-09 00:00:00.000000000 Z
11
+ date: 2021-02-25 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: colsole
@@ -16,48 +16,42 @@ dependencies:
16
16
  requirements:
17
17
  - - "~>"
18
18
  - !ruby/object:Gem::Version
19
- version: '0.5'
20
- - - ">="
21
- - !ruby/object:Gem::Version
22
- version: 0.5.4
19
+ version: '0.7'
23
20
  type: :runtime
24
21
  prerelease: false
25
22
  version_requirements: !ruby/object:Gem::Requirement
26
23
  requirements:
27
24
  - - "~>"
28
25
  - !ruby/object:Gem::Version
29
- version: '0.5'
30
- - - ">="
31
- - !ruby/object:Gem::Version
32
- version: 0.5.4
26
+ version: '0.7'
33
27
  - !ruby/object:Gem::Dependency
34
28
  name: docopt
35
29
  requirement: !ruby/object:Gem::Requirement
36
30
  requirements:
37
31
  - - "~>"
38
32
  - !ruby/object:Gem::Version
39
- version: '0.5'
33
+ version: '0.6'
40
34
  type: :runtime
41
35
  prerelease: false
42
36
  version_requirements: !ruby/object:Gem::Requirement
43
37
  requirements:
44
38
  - - "~>"
45
39
  - !ruby/object:Gem::Version
46
- version: '0.5'
40
+ version: '0.6'
47
41
  - !ruby/object:Gem::Dependency
48
42
  name: nokogiri
49
43
  requirement: !ruby/object:Gem::Requirement
50
44
  requirements:
51
45
  - - "~>"
52
46
  - !ruby/object:Gem::Version
53
- version: '1.6'
47
+ version: '1.10'
54
48
  type: :runtime
55
49
  prerelease: false
56
50
  version_requirements: !ruby/object:Gem::Requirement
57
51
  requirements:
58
52
  - - "~>"
59
53
  - !ruby/object:Gem::Version
60
- version: '1.6'
54
+ version: '1.10'
61
55
  - !ruby/object:Gem::Dependency
62
56
  name: webshot
63
57
  requirement: !ruby/object:Gem::Requirement
@@ -78,14 +72,14 @@ dependencies:
78
72
  requirements:
79
73
  - - "~>"
80
74
  - !ruby/object:Gem::Version
81
- version: '0.17'
75
+ version: '0.18'
82
76
  type: :runtime
83
77
  prerelease: false
84
78
  version_requirements: !ruby/object:Gem::Requirement
85
79
  requirements:
86
80
  - - "~>"
87
81
  - !ruby/object:Gem::Version
88
- version: '0.17'
82
+ version: '0.18'
89
83
  - !ruby/object:Gem::Dependency
90
84
  name: addressable
91
85
  requirement: !ruby/object:Gem::Requirement
@@ -100,6 +94,34 @@ dependencies:
100
94
  - - "~>"
101
95
  - !ruby/object:Gem::Version
102
96
  version: '2.7'
97
+ - !ruby/object:Gem::Dependency
98
+ name: lightly
99
+ requirement: !ruby/object:Gem::Requirement
100
+ requirements:
101
+ - - "~>"
102
+ - !ruby/object:Gem::Version
103
+ version: '0.3'
104
+ type: :runtime
105
+ prerelease: false
106
+ version_requirements: !ruby/object:Gem::Requirement
107
+ requirements:
108
+ - - "~>"
109
+ - !ruby/object:Gem::Version
110
+ version: '0.3'
111
+ - !ruby/object:Gem::Dependency
112
+ name: sting
113
+ requirement: !ruby/object:Gem::Requirement
114
+ requirements:
115
+ - - "~>"
116
+ - !ruby/object:Gem::Version
117
+ version: '0.4'
118
+ type: :runtime
119
+ prerelease: false
120
+ version_requirements: !ruby/object:Gem::Requirement
121
+ requirements:
122
+ - - "~>"
123
+ - !ruby/object:Gem::Version
124
+ version: '0.4'
103
125
  description: Snapcrawl is a command line utility for crawling a website and saving
104
126
  screenshots.
105
127
  email: db@dannyben.com
@@ -111,9 +133,19 @@ files:
111
133
  - README.md
112
134
  - bin/snapcrawl
113
135
  - lib/snapcrawl.rb
136
+ - lib/snapcrawl/cli.rb
137
+ - lib/snapcrawl/config.rb
114
138
  - lib/snapcrawl/crawler.rb
115
- - lib/snapcrawl/docopt.txt
139
+ - lib/snapcrawl/dependencies.rb
116
140
  - lib/snapcrawl/exceptions.rb
141
+ - lib/snapcrawl/log_helpers.rb
142
+ - lib/snapcrawl/page.rb
143
+ - lib/snapcrawl/pretty_logger.rb
144
+ - lib/snapcrawl/refinements/pair_split.rb
145
+ - lib/snapcrawl/refinements/string_refinements.rb
146
+ - lib/snapcrawl/screenshot.rb
147
+ - lib/snapcrawl/templates/config.yml
148
+ - lib/snapcrawl/templates/docopt.txt
117
149
  - lib/snapcrawl/version.rb
118
150
  homepage: https://github.com/DannyBen/snapcrawl
119
151
  licenses:
@@ -134,7 +166,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
134
166
  - !ruby/object:Gem::Version
135
167
  version: '0'
136
168
  requirements: []
137
- rubygems_version: 3.0.3
169
+ rubygems_version: 3.2.3
138
170
  signing_key:
139
171
  specification_version: 4
140
172
  summary: Crawl a website and take screenshots (CLI + Library)
@@ -1,48 +0,0 @@
1
- Snapcrawl
2
-
3
- Usage:
4
- snapcrawl URL [options]
5
- snapcrawl -h | --help
6
- snapcrawl -v | --version
7
-
8
- Options:
9
- -f, --folder PATH
10
- Where to save screenshots [default: snaps]
11
-
12
- -n, --name TEMPLATE
13
- Filename template. Include the string '%{url}' anywhere in the name to
14
- use the captured URL in the filename [default: %{url}]
15
-
16
- -a, --age SECONDS
17
- Number of seconds to consider screenshots fresh [default: 86400]
18
-
19
- -d, --depth LEVELS
20
- Number of levels to crawl [default: 1]
21
-
22
- -W, --width PIXELS
23
- Screen width in pixels [default: 1280]
24
-
25
- -H, --height PIXELS
26
- Screen height in pixels. Use 0 to capture the full page [default: 0]
27
-
28
- -s, --selector SELECTOR
29
- CSS selector to capture
30
-
31
- -o, --only REGEX
32
- Include only URLs that match REGEX
33
-
34
- -h, --help
35
- Show this screen
36
-
37
- -v, --version
38
- Show version number
39
-
40
- Examples:
41
- snapcrawl example.com
42
- snapcrawl example.com -d2 -fscreens
43
- snapcrawl example.com -d2 > out.txt 2> err.txt &
44
- snapcrawl example.com -W360 -H480
45
- snapcrawl example.com --selector "#main-content"
46
- snapcrawl example.com --only "products|collections"
47
- snapcrawl example.com --name "screenshot-%{url}"
48
- snapcrawl example.com --name "`date +%Y%m%d`_%{url}"