snapcrawl 0.4.4 → 0.5.0.rc1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 99d66eca6d9f9ef3b952591ab2ca355094cd0dcea07669ae76474693d5b4caf4
4
- data.tar.gz: 04b4d1a41d8e3b550519f87c9f837ef5ad0ff5b9af7528e317a1dd0943326e12
3
+ metadata.gz: ced7afea220ea7c23c7207037cb32d02625fc3278e8e2347c0c9327fc0f0e509
4
+ data.tar.gz: 12e7a758a10cba960027ce2152187aed99cfec0c0ea2a434431a34a11a1e2f04
5
5
  SHA512:
6
- metadata.gz: 07bb6174d3681559d18cb73ca3c7b18f37c60a13f483ccab530b31845db7019f2413680e4e61907cadf8aeccdf6343bee484bafd444c321978de279ca3bbcda6
7
- data.tar.gz: 70bbb5cf8508417b5a98c3af830efb8251cc7216c73ef66b1c31495d573e8289d1b1662779305ff42a6aaafedb7f621ba7942b6b469f17c6513617e98dbe8432
6
+ metadata.gz: 117c0157a09a7e040c3c487c6f0d51fa20ad9c9a6be965cb8083eb32c6201effa406d0cbbd428190e1ffc41b1097347113e3ce03a88eae273a5d4d6fd2a8c85d
7
+ data.tar.gz: 5261d94ef0a0a2223963b70fd0bd8cc6c822e31a693d5bbcc8f452e51f92ef519df20decea90351c3e85f3bfaf30e725be1d6a4d76b4d2748663de44a7772e88
data/README.md CHANGED
@@ -1,5 +1,4 @@
1
- Snapcrawl - crawl a website and take screenshots
2
- ==================================================
1
+ # Snapcrawl - crawl a website and take screenshots
3
2
 
4
3
  [![Gem Version](https://badge.fury.io/rb/snapcrawl.svg)](http://badge.fury.io/rb/snapcrawl)
5
4
  [![Build Status](https://github.com/DannyBen/snapcrawl/workflows/Test/badge.svg)](https://github.com/DannyBen/snapcrawl/actions?query=workflow%3ATest)
@@ -11,8 +10,7 @@ Snapcrawl is a command line utility for crawling a website and saving
11
10
  screenshots.
12
11
 
13
12
 
14
- Features
15
- --------------------------------------------------
13
+ ## Features
16
14
 
17
15
  - Crawls a website to any given depth and saves screenshots
18
16
  - Can capture the full length of the page
@@ -21,100 +19,109 @@ Features
21
19
  - Uses local caching to avoid expensive crawl operations if not needed
22
20
  - Reports broken links
23
21
 
22
+ ## Install
24
23
 
25
- Prerequisites
26
- --------------------------------------------------
27
-
28
- Snapcrawl requires [PhantomJS][1] and [ImageMagick][2].
29
-
30
-
31
- Docker Image
32
- --------------------------------------------------
24
+ **Using Docker**
33
25
 
34
26
  You can run Snapcrawl by using this docker image (which contains all the
35
27
  necessary prerequisites):
36
28
 
37
- ```
38
- $ docker pull dannyben/snapcrawl
29
+ ```shell
30
+ $ alias snapcrawl="docker run --rm -it --volume $PWD:/app dannyben/snapcrawl"
39
31
  ```
40
32
 
41
- Then you can use it like this:
33
+ For more information on the Docker image, refer to the [docker-snapcrawl][3] repository.
42
34
 
43
- ```
44
- $ docker run --rm -it dannyben/snapcrawl --help
35
+ **Using Ruby**
36
+
37
+ ```shell
38
+ $ gem install snapcrawl
45
39
  ```
46
40
 
47
- For more information refer to the [docker-snapcrawl][3] repository.
41
+ Note that Snapcrawl requires [PhantomJS][1] and [ImageMagick][2].
48
42
 
43
+ ## Usage
49
44
 
50
- Install
51
- --------------------------------------------------
45
+ Snapcrawl can be configured either through a configuration file (YAML), or by specifying options in the command line.
52
46
 
47
+ ```shell
48
+ $ snapcrawl
49
+ Usage:
50
+ snapcrawl URL [--config FILE] [SETTINGS...]
51
+ snapcrawl -h | --help
52
+ snapcrawl -v | --version
53
53
  ```
54
- $ gem install snapcrawl
54
+
55
+ The default configuration filename is `snapcrawl.yml`.
56
+
57
+ Using the `--config` flag will create a template configuration file if it is not present:
58
+
59
+ ```shell
60
+ $ snapcrawl example.com --config snapcrawl
55
61
  ```
56
62
 
63
+ ### Specifying options in the command line
57
64
 
58
- Usage
59
- --------------------------------------------------
65
+ All configuration options can be specified in the command line as `key=value` pairs:
60
66
 
67
+ ```shell
68
+ $ snapcrawl example.com log_level=0 depth=2 width=1024
61
69
  ```
62
- $ snapcrawl --help
63
70
 
64
- Snapcrawl
71
+ ### Sample configuration file
65
72
 
66
- Usage:
67
- snapcrawl URL [options]
68
- snapcrawl -h | --help
69
- snapcrawl -v | --version
73
+ ```yaml
74
+ # All values below are the default values
75
+
76
+ # log level (0-4) 0=DEBUG 1=INFO 2=WARN 3=ERROR 4=FATAL
77
+ log_level: 1
70
78
 
71
- Options:
72
- -f, --folder PATH
73
- Where to save screenshots [default: snaps]
79
+ # log_color (yes, no, auto)
80
+ # yes = always show log color
81
+ # no = never use colors
82
+ # auto = only use colors when running in an interactive terminal
83
+ log_color: auto
74
84
 
75
- -n, --name TEMPLATE
76
- Filename template. Include the string '%{url}' anywhere in the name to
77
- use the captured URL in the filename [default: %{url}]
85
+ # number of levels to crawl, 0 means capture only the root URL
86
+ depth: 1
78
87
 
79
- -a, --age SECONDS
80
- Number of seconds to consider screenshots fresh [default: 86400]
88
+ # screenshot width in pixels
89
+ width: 1280
81
90
 
82
- -d, --depth LEVELS
83
- Number of levels to crawl [default: 1]
91
+ # screenshot height in pixels, 0 means the entire height
92
+ height: 0
84
93
 
85
- -W, --width PIXELS
86
- Screen width in pixels [default: 1280]
94
+ # number of seconds to consider the page cache and its screenshot fresh
95
+ cache_life: 86400
87
96
 
88
- -H, --height PIXELS
89
- Screen height in pixels. Use 0 to capture the full page [default: 0]
97
+ # where to store the HTML page cache
98
+ cache_dir: cache
90
99
 
91
- -s, --selector SELECTOR
92
- CSS selector to capture
100
+ # where to store screenshots
101
+ snaps_dir: snaps
93
102
 
94
- -o, --only REGEX
95
- Include only URLs that match REGEX
103
+ # screenshot filename template, where '%{url}' will be replaced with a
104
+ # slug version of the URL (no need to include the .png extension)
105
+ name_template: '%{url}'
96
106
 
97
- -h, --help
98
- Show this screen
107
+ # urls not matching this regular expression will be ignored
108
+ url_whitelist:
99
109
 
100
- -v, --version
101
- Show version number
110
+ # urls matching this regular expression will be ignored
111
+ url_blacklist:
102
112
 
103
- Examples:
104
- snapcrawl example.com
105
- snapcrawl example.com -d2 -fscreens
106
- snapcrawl example.com -d2 > out.txt 2> err.txt &
107
- snapcrawl example.com -W360 -H480
108
- snapcrawl example.com --selector "#main-content"
109
- snapcrawl example.com --only "products|collections"
110
- snapcrawl example.com --name "screenshot-%{url}"
111
- snapcrawl example.com --name "`date +%Y%m%d`_%{url}"
113
+ # take a screenshot of this CSS selector only
114
+ css_selector:
112
115
  ```
113
116
 
117
+ ## Contributing / Support
118
+ If you experience any issue, have a question or a suggestion, or if you wish
119
+ to contribute, feel free to [open an issue][issues].
120
+
114
121
  ---
115
122
 
116
123
  [1]: http://phantomjs.org/download.html
117
124
  [2]: https://imagemagick.org/script/download.php
118
125
  [3]: https://github.com/DannyBen/docker-snapcrawl
119
-
126
+ [issues]: https://github.com/DannyBen/snapcrawl/issues
120
127
 
data/bin/snapcrawl CHANGED
@@ -1,22 +1,30 @@
1
1
  #!/usr/bin/env ruby
2
2
 
3
3
  require 'snapcrawl'
4
+ require 'colsole'
5
+
4
6
  trap(:INT) { abort "\r\nGoodbye" }
7
+
5
8
  include Snapcrawl
9
+ include Colsole
6
10
 
7
11
  begin
8
- Crawler.instance.handle ARGV
12
+ CLI.new.call ARGV
13
+
9
14
  rescue MissingPhantomJS => e
10
15
  message = "Cannot find phantomjs executable in the path, please install it first."
11
16
  say! "\n\n!undred!#{e.class}!txtrst!\n#{message}"
12
17
  exit 2
18
+
13
19
  rescue MissingImageMagick=> e
14
20
  message = "Cannot find convert (ImageMagick) executable in the path, please install it first."
15
21
  say! "\n\n!undred!#{e.class}!txtrst!\n#{message}"
16
22
  exit 3
23
+
17
24
  rescue => e
18
25
  puts e.backtrace.reverse if ENV['DEBUG']
19
- say! "\n\n!undred!#{e.class}!txtrst!\n#{e.message}"
26
+ say! "\n!undred!#{e.class}!txtrst!\n#{e.message}"
20
27
  exit 1
28
+
21
29
  end
22
30
 
data/lib/snapcrawl.rb CHANGED
@@ -1,6 +1,20 @@
1
1
  require 'snapcrawl/version'
2
2
  require 'snapcrawl/exceptions'
3
+ require 'snapcrawl/refinements/pair_split'
4
+ require 'snapcrawl/refinements/string_refinements'
5
+ require 'snapcrawl/log_helpers'
6
+ require 'snapcrawl/pretty_logger'
7
+ require 'snapcrawl/dependencies'
8
+ require 'snapcrawl/config'
9
+ require 'snapcrawl/screenshot'
10
+ require 'snapcrawl/page'
3
11
  require 'snapcrawl/crawler'
12
+ require 'snapcrawl/cli'
4
13
 
5
- require 'byebug' if ENV['BYEBUG']
14
+ if ENV['BYEBUG']
15
+ require 'byebug'
16
+ require 'lp'
17
+ end
6
18
 
19
+ Snapcrawl::Config.load
20
+ $logger = Snapcrawl::PrettyLogger.new
@@ -0,0 +1,55 @@
1
+ require 'colsole'
2
+ require 'docopt'
3
+ require 'fileutils'
4
+
5
+ module Snapcrawl
6
+ class CLI
7
+ include Colsole
8
+ using StringRefinements
9
+ using PairSplit
10
+
11
+ def call(args = [])
12
+ begin
13
+ execute Docopt::docopt(docopt, version: VERSION, argv: args)
14
+ rescue Docopt::Exit => e
15
+ puts e.message
16
+ end
17
+ end
18
+
19
+ private
20
+
21
+ def execute(args)
22
+ status = Config.load args['--config']
23
+ $logger.debug 'config file created' if status == :created
24
+
25
+ tweaks = args['SETTINGS'].pair_split
26
+ apply_tweaks tweaks if tweaks
27
+
28
+ Dependencies.verify
29
+
30
+ $logger.debug 'initializing cli'
31
+ FileUtils.mkdir_p Config.snaps_dir
32
+
33
+ url = args['URL'].protocolize
34
+ crawler = Crawler.new url
35
+
36
+ crawler.crawl
37
+ end
38
+
39
+ def docopt
40
+ @doc ||= File.read docopt_path
41
+ end
42
+
43
+ def docopt_path
44
+ File.expand_path "templates/docopt.txt", __dir__
45
+ end
46
+
47
+ def apply_tweaks(tweaks)
48
+ tweaks.each do |key, value|
49
+ Config.settings[key] = value
50
+ $logger.level = value if key == 'log_level'
51
+ end
52
+ end
53
+
54
+ end
55
+ end
@@ -0,0 +1,54 @@
1
+ require 'sting'
2
+ require 'fileutils'
3
+
4
+ module Snapcrawl
5
+ class Config < Sting
6
+ class << self
7
+ def load(file = nil)
8
+ reset!
9
+ push defaults
10
+
11
+ return unless file
12
+
13
+ file = "#{file}.yml" unless file =~ /\.ya?ml$/
14
+
15
+ if File.exist? file
16
+ push file
17
+ else
18
+ create_config file
19
+ end
20
+ end
21
+
22
+ private
23
+
24
+ def defaults
25
+ {
26
+ depth: 1,
27
+ width: 1280,
28
+ height: 0,
29
+ cache_life: 86400,
30
+ cache_dir: 'cache',
31
+ snaps_dir: 'snaps',
32
+ name_template: '%{url}',
33
+ url_whitelist: nil,
34
+ css_selector: nil,
35
+ log_level: 1,
36
+ log_color: 'auto',
37
+ }
38
+ end
39
+
40
+ def create_config(file)
41
+ $logger.debug "creating config file %{green}#{file}%{reset}"
42
+ content = File.read config_template
43
+ dir = File.dirname file
44
+ FileUtils.mkdir_p dir
45
+ File.write file, content
46
+ end
47
+
48
+ def config_template
49
+ File.expand_path 'templates/config.yml', __dir__
50
+ end
51
+
52
+ end
53
+ end
54
+ end
@@ -1,267 +1,93 @@
1
- require 'colsole'
2
- require 'docopt'
3
1
  require 'fileutils'
4
- require 'httparty'
5
- require 'nokogiri'
6
- require 'ostruct'
7
- require 'pstore'
8
- require 'addressable/uri'
9
- require 'webshot'
10
2
 
11
3
  module Snapcrawl
12
- include Colsole
13
-
14
4
  class Crawler
15
- include Singleton
16
-
17
- def initialize
18
- @storefile = "snapcrawl.pstore"
19
- @store = PStore.new(@storefile)
20
- end
5
+ using StringRefinements
21
6
 
22
- def handle(args)
23
- @done = []
24
- begin
25
- execute Docopt::docopt(doc, version: VERSION, argv: args)
26
- rescue Docopt::Exit => e
27
- puts e.message
28
- end
29
- end
7
+ attr_reader :url
30
8
 
31
- def execute(args)
32
- raise MissingPhantomJS unless command_exist? "phantomjs"
33
- raise MissingImageMagick unless command_exist? "convert"
34
- crawl args['URL'].dup, opts_from_args(args)
9
+ def initialize(url)
10
+ $logger.debug "initializing crawler with %{green}#{url}%{reset}"
11
+
12
+ config_for_display = Config.settings.dup
13
+ config_for_display['name_template'] = '%%{url}'
14
+
15
+ $logger.debug "config #{config_for_display}"
16
+ @url = url
35
17
  end
36
18
 
37
- def clear_cache
38
- FileUtils.rm @storefile if File.exist? @storefile
19
+ def crawl
20
+ Dependencies.verify
21
+ todo[url] = Page.new url
22
+ process_todo while todo.any?
39
23
  end
40
24
 
41
25
  private
42
26
 
43
- def crawl(url, opts={})
44
- url = protocolize url
45
- defaults = {
46
- width: 1280,
47
- height: 0,
48
- depth: 1,
49
- age: 86400,
50
- folder: 'snaps',
51
- name: '%{url}',
52
- base: url,
53
- }
54
- urls = [url]
55
-
56
- @opts = OpenStruct.new defaults.merge(opts)
27
+ def process_todo
28
+ $logger.debug "processing queue: %{green}#{todo.count} remaining%{reset}"
57
29
 
58
- make_screenshot_dir @opts.folder
30
+ url, page = todo.shift
31
+ done.push url
59
32
 
60
- @opts.depth.times do
61
- urls = crawl_and_snap urls
33
+ if process_page page
34
+ register_sub_pages page.pages if page.depth < Config.depth
62
35
  end
63
36
  end
64
37
 
65
- def crawl_and_snap(urls)
66
- new_urls = []
67
- urls.each do |url|
68
- next if @done.include? url
69
- @done << url
70
- say "\n!txtgrn!-----> Visit: #{url}"
71
- if @opts.only and url !~ /#{@opts.only}/
72
- say " Snap: Skipping. Does not match regex"
73
- else
74
- snap url
38
+ def register_sub_pages(pages)
39
+ pages.each do |sub_page|
40
+ next if todo.has_key?(sub_page) or done.include?(sub_page)
41
+
42
+ if Config.url_whitelist and sub_page.path !~ /#{Config.url_whitelist}/
43
+ $logger.debug "ignoring %{purple}%{underlined}#{sub_page.url}%{reset}, reason: whitelist"
44
+ next
75
45
  end
76
- new_urls += extract_urls_from url
77
- end
78
- new_urls
79
- end
80
46
 
81
- # Take a screenshot of a URL, unless we already did so recently
82
- def snap(url)
83
- file = image_path_for(url)
84
- if file_fresh? file
85
- say " Snap: Skipping. File exists and seems fresh"
86
- else
87
- snap!(url)
88
- end
89
- end
90
-
91
- # Take a screenshot of the URL, even if file exists
92
- def snap!(url)
93
- say " !txtblu!Snap!!txtrst! Snapping picture... "
94
- image_path = image_path_for url
47
+ if Config.url_blacklist and sub_page.path =~ /#{Config.url_blacklist}/
48
+ $logger.debug "ignoring %{purple}%{underlined}#{sub_page.url}%{reset}, reason: blacklist"
49
+ next
50
+ end
95
51
 
96
- fetch_opts = { allowed_status_codes: [404, 401, 403] }
97
- if @opts.selector
98
- fetch_opts[:selector] = @opts.selector
99
- fetch_opts[:full] = false
52
+ todo[sub_page.url] = sub_page
100
53
  end
101
-
102
- webshot_capture url, image_path, fetch_opts
103
- say "done"
104
54
  end
105
55
 
106
- def webshot_capture(url, image_path, fetch_opts)
107
- webshot_capture! url, image_path, fetch_opts
108
- rescue => e
109
- say "!txtred!FAILED"
110
- say "!txtred! ! #{e.class}: #{e.message.strip}"
111
- end
56
+ def process_page(page)
57
+ outfile = "#{Config.snaps_dir}/#{Config.name_template}.png" % { url: page.url.to_slug }
112
58
 
113
- def webshot_capture!(url, image_path, fetch_opts)
114
- hide_output do
115
- webshot.capture url, image_path, fetch_opts do |magick|
116
- magick.combine_options do |c|
117
- c.background "white"
118
- c.gravity 'north'
119
- c.quality 100
120
- c.extent @opts.height > 0 ? "#{@opts.width}x#{@opts.height}" : "#{@opts.width}x"
121
- end
122
- end
123
- end
124
- end
59
+ $logger.info "processing %{purple}%{underlined}#{page.url}%{reset}, depth: #{page.depth}"
125
60
 
126
- def extract_urls_from(url)
127
- cached = nil
128
- @store.transaction { cached = @store[url] }
129
- if cached
130
- say " Crawl: Page was cached. Reading subsequent URLs from cache"
131
- return cached
132
- else
133
- return extract_urls_from! url
61
+ if !page.valid?
62
+ $logger.debug "page #{page.path} is invalid, aborting process"
63
+ return false
134
64
  end
135
- end
136
-
137
- def extract_urls_from!(url)
138
- say " !txtblu!Crawl!!txtrst! Extracting links... "
139
65
 
140
- begin
141
- response = HTTParty.get url
142
- if response.success?
143
- doc = Nokogiri::HTML response.body
144
- links = doc.css('a')
145
- links, warnings = normalize_links links
146
- @store.transaction { @store[url] = links }
147
- say "done"
148
- warnings.each do |warning|
149
- say "!txtylw! Warn: #{warning[:link]}"
150
- say word_wrap " #{warning[:message]}"
151
- end
152
- else
153
- links = []
154
- say "!txtred!FAILED"
155
- say "!txtred! ! HTTP Error: #{response.code} #{response.message.strip} at #{url}"
156
- end
66
+ if file_fresh? outfile
67
+ $logger.info "screenshot for #{page.path} already exists"
68
+ else
69
+ $logger.info "%{bold}capturing screenshot for #{page.path}%{reset}"
70
+ page.save_screenshot outfile
157
71
  end
158
- links
159
- end
160
-
161
- # mkdir the screenshots folder, if needed
162
- def make_screenshot_dir(dir)
163
- Dir.exist? dir or FileUtils.mkdir_p dir
164
- end
165
72
 
166
- # Convert any string to a proper handle
167
- def handelize(str)
168
- str.downcase.gsub(/[^a-z0-9]+/, '-')
73
+ true
169
74
  end
170
75
 
171
- # Return proper image path for a UR
172
- def image_path_for(url)
173
- "#{@opts.folder}/#{@opts.name}.png" % { url: handelize(url) }
174
- end
175
-
176
- # Add protocol to a URL if neeed
177
- def protocolize(url)
178
- url =~ /^http/ ? url : "http://#{url}"
179
- end
180
-
181
- # Return true if the file exists and is not too old
182
76
  def file_fresh?(file)
183
- @opts.age > 0 and File.exist?(file) and file_age(file) < @opts.age
77
+ Config.cache_life > 0 and File.exist?(file) and file_age(file) < Config.cache_life
184
78
  end
185
79
 
186
- # Return file age in seconds
187
80
  def file_age(file)
188
81
  (Time.now - File.stat(file).mtime).to_i
189
82
  end
190
83
 
191
- # Process an array of links and return a better one
192
- def normalize_links(links)
193
- extensions = "png|gif|jpg|pdf|zip"
194
- beginnings = "mailto|tel"
195
-
196
- links_array = []
197
- warnings = []
198
-
199
- links.each do |link|
200
- link = link.attribute('href').to_s.dup
201
-
202
- # Remove #hash
203
- link.gsub!(/#.+$/, '')
204
- next if link.empty?
205
-
206
- # Remove links to specific extensions and protocols
207
- next if link =~ /\.(#{extensions})(\?.*)?$/
208
- next if link =~ /^(#{beginnings})/
209
-
210
- # Strip spaces
211
- link.strip!
212
-
213
- # Convert relative links to absolute
214
- begin
215
- link = Addressable::URI.join( @opts.base, link ).to_s.dup
216
- rescue => e
217
- warnings << { link: link, message: "#{e.class} #{e.message}" }
218
- next
219
- end
220
-
221
- # Keep only links in our base domain
222
- next unless link.include? @opts.base
223
-
224
- links_array << link
225
- end
226
-
227
- [links_array.uniq, warnings]
228
- end
229
-
230
- def doc
231
- @doc ||= File.read docopt
84
+ def todo
85
+ @todo ||= {}
232
86
  end
233
87
 
234
- def docopt
235
- File.expand_path "docopt.txt", __dir__
88
+ def done
89
+ @done ||= []
236
90
  end
237
91
 
238
- def opts_from_args(args)
239
- opts = {}
240
- %w[folder name selector only].each do |opt|
241
- opts[opt.to_sym] = args["--#{opt}"] if args["--#{opt}"]
242
- end
243
-
244
- %w[age depth width height].each do |opt|
245
- opts[opt.to_sym] = args["--#{opt}"].to_i if args["--#{opt}"]
246
- end
247
-
248
- opts
249
- end
250
-
251
- def webshot
252
- @webshot ||= Webshot::Screenshot.instance
253
- end
254
-
255
- # The webshot gem messes with stdout/stderr streams so we keep it in
256
- # check by using this method. Also, in some sites (e.g. uown.co) it
257
- # prints some output to stdout, this is why we override $stdout for
258
- # the duration of the run.
259
- def hide_output
260
- keep_stdout, keep_stderr = $stdout, $stderr
261
- $stdout, $stderr = StringIO.new, StringIO.new
262
- yield
263
- ensure
264
- $stdout, $stderr = keep_stdout, keep_stderr
265
- end
266
92
  end
267
93
  end
@@ -0,0 +1,21 @@
1
+ require 'colsole'
2
+
3
+ module Snapcrawl
4
+ class Dependencies
5
+ class << self
6
+ include Colsole
7
+
8
+ def verify
9
+ return if @verified
10
+
11
+ $logger.debug 'verifying %{green}phantomjs%{reset} is present'
12
+ raise MissingPhantomJS unless command_exist? "phantomjs"
13
+
14
+ $logger.debug 'verifying %{green}imagemagick%{reset} is present'
15
+ raise MissingImageMagick unless command_exist? "convert"
16
+
17
+ @verified = true
18
+ end
19
+ end
20
+ end
21
+ end
@@ -1,4 +1,5 @@
1
1
  module Snapcrawl
2
2
  class MissingPhantomJS < StandardError; end
3
3
  class MissingImageMagick < StandardError; end
4
+ class ScreenshotError < StandardError; end
4
5
  end
@@ -0,0 +1,57 @@
1
+ module Snapcrawl
2
+ module LogHelpers
3
+ SEVERITY_COLORS = {
4
+ 'INFO' => :blue,
5
+ 'WARN' => :yellow,
6
+ 'ERROR' => :red,
7
+ 'FATAL' => :red,
8
+ 'DEBUG' => :cyan
9
+ }
10
+
11
+ def log_formatter
12
+ proc do |severity, _time, _prog, message|
13
+ severity_color = SEVERITY_COLORS[severity]
14
+
15
+ "%{#{severity_color}}#{severity.rjust 5}%{reset} : #{message}\n" % log_colors
16
+ end
17
+ end
18
+
19
+ def log_colors
20
+ @log_colors ||= log_colors!
21
+ end
22
+
23
+ def log_colors!
24
+ colors? ? actual_colors : empty_colors
25
+ end
26
+
27
+ def actual_colors
28
+ {
29
+ red: "\e[31m", green: "\e[32m", yellow: "\e[33m",
30
+ blue: "\e[34m", purple: "\e[35m", cyan: "\e[36m",
31
+ underlined: "\e[4m", bold: "\e[1m",
32
+ none: "", reset: "\e[0m"
33
+ }
34
+ end
35
+
36
+ def empty_colors
37
+ {
38
+ red: "", green: "", yellow: "",
39
+ blue: "", purple: "", cyan: "",
40
+ underlined: "", bold: "",
41
+ none: "", reset: ""
42
+ }
43
+ end
44
+
45
+ def colors?
46
+ if Config.log_color == 'auto'
47
+ tty?
48
+ else
49
+ Config.log_color
50
+ end
51
+ end
52
+
53
+ def tty?
54
+ ENV['TTY'] == 'on' ? true : ENV['TTY'] == 'off' ? false : $stdout.tty?
55
+ end
56
+ end
57
+ end
@@ -0,0 +1,111 @@
1
+ require 'addressable/uri'
2
+ require 'fileutils'
3
+ require 'httparty'
4
+ require 'lightly'
5
+ require 'nokogiri'
6
+
7
+ module Snapcrawl
8
+ class Page
9
+ using StringRefinements
10
+
11
+ attr_reader :url, :depth
12
+
13
+ EXTENSION_BLACKLIST = "png|gif|jpg|pdf|zip"
14
+ PROTOCOL_BLACKLIST = "mailto|tel"
15
+
16
+ def initialize(url, depth: 0)
17
+ @url, @depth = url.protocolize, depth
18
+ end
19
+
20
+ def valid?
21
+ http_response&.success?
22
+ end
23
+
24
+ def site
25
+ @site ||= Addressable::URI.parse(url).site
26
+ end
27
+
28
+ def path
29
+ @path ||= Addressable::URI.parse(url).request_uri
30
+ end
31
+
32
+ def links
33
+ return nil unless valid?
34
+ doc = Nokogiri::HTML http_response.body
35
+ normalize_links doc.css('a')
36
+ end
37
+
38
+ def pages
39
+ return nil unless valid?
40
+ links.map { |link| Page.new link, depth: depth+1 }
41
+ end
42
+
43
+ def save_screenshot(outfile)
44
+ return false unless valid?
45
+ Screenshot.new(url).save "#{outfile}"
46
+ end
47
+
48
+ private
49
+
50
+ def http_response
51
+ @http_response ||= http_response!
52
+ end
53
+
54
+ def http_response!
55
+ response = cache.get(url) { HTTParty.get url }
56
+
57
+ if !response.success?
58
+ $logger.warn "http error on %{purple}%{underlined}#{url}%{reset}, code: %{yellow}#{response.code}%{reset}, message: #{response.message.strip}"
59
+ end
60
+
61
+ response
62
+
63
+ rescue => e
64
+ $logger.error "http error on %{purple}%{underlined}#{url}%{reset} - %{red}#{e.class}%{reset}: #{e.message}"
65
+ nil
66
+
67
+ end
68
+
69
+ def normalize_links(links)
70
+ result = []
71
+
72
+ links.each do |link|
73
+ valid_link = normalize_link link
74
+ result << valid_link if valid_link
75
+ end
76
+
77
+ result.uniq
78
+ end
79
+
80
+ def normalize_link(link)
81
+ link = link.attribute('href').to_s.dup
82
+
83
+ # Remove #hash
84
+ link.gsub!(/#.+$/, '')
85
+ return nil if link.empty?
86
+
87
+ # Remove links to specific extensions and protocols
88
+ return nil if link =~ /\.(#{EXTENSION_BLACKLIST})(\?.*)?$/
89
+ return nil if link =~ /^(#{PROTOCOL_BLACKLIST}):/
90
+
91
+ # Strip spaces
92
+ link.strip!
93
+
94
+ # Convert relative links to absolute
95
+ begin
96
+ link = Addressable::URI.join(url, link).to_s.dup
97
+ rescue => e
98
+ $logger.warn "%{red}#{e.class}%{reset}: #{e.message} on #{path} (link: #{link})"
99
+ return nil
100
+ end
101
+
102
+ # Keep only links in our base domain
103
+ return nil unless link.include? site
104
+ link
105
+ end
106
+
107
+ def cache
108
+ Lightly.new life: Config.cache_life
109
+ end
110
+ end
111
+ end
@@ -0,0 +1,11 @@
1
+ require 'logger'
2
+
3
+ module Snapcrawl
4
+ class PrettyLogger
5
+ extend LogHelpers
6
+
7
+ def self.new
8
+ Logger.new(STDOUT, formatter: log_formatter, level: Config.log_level)
9
+ end
10
+ end
11
+ end
@@ -0,0 +1,23 @@
1
+ module Snapcrawl
2
+ module PairSplit
3
+ refine Array do
4
+ def pair_split
5
+ map do |pair|
6
+ key, value = pair.split '='
7
+
8
+ value = if value =~ /^\d+$/
9
+ value.to_i
10
+ elsif ['no', 'false'].include? value
11
+ false
12
+ elsif ['yes', 'true'].include? value
13
+ true
14
+ else
15
+ value
16
+ end
17
+
18
+ [key, value]
19
+ end.to_h
20
+ end
21
+ end
22
+ end
23
+ end
@@ -0,0 +1,13 @@
1
+ module Snapcrawl
2
+ module StringRefinements
3
+ refine String do
4
+ def to_slug
5
+ downcase.gsub(/[^a-z0-9]+/, '-')
6
+ end
7
+
8
+ def protocolize
9
+ self =~ /^http/ ? self : "http://#{self}"
10
+ end
11
+ end
12
+ end
13
+ end
@@ -0,0 +1,62 @@
1
+ require 'webshot'
2
+
3
+ module Snapcrawl
4
+ class Screenshot
5
+ using StringRefinements
6
+
7
+ attr_reader :url
8
+
9
+ def initialize(url)
10
+ @url = url
11
+ end
12
+
13
+ def save(outfile = nil)
14
+ outfile ||= "#{url.to_slug}.png"
15
+
16
+ fetch_opts = { allowed_status_codes: [404, 401, 403] }
17
+ if Config.selector
18
+ fetch_opts[:selector] = Config.selector
19
+ fetch_opts[:full] = false
20
+ end
21
+
22
+ webshot_capture url, outfile, fetch_opts
23
+ end
24
+
25
+ private
26
+
27
+ def webshot_capture(url, image_path, fetch_opts)
28
+ webshot_capture! url, image_path, fetch_opts
29
+ rescue => e
30
+ raise ScreenshotError, "#{e.class} #{e.message}"
31
+ end
32
+
33
+ def webshot_capture!(url, image_path, fetch_opts)
34
+ hide_output do
35
+ webshot.capture url, image_path, fetch_opts do |magick|
36
+ magick.combine_options do |c|
37
+ c.background "white"
38
+ c.gravity 'north'
39
+ c.quality 100
40
+ c.extent Config.height > 0 ? "#{Config.width}x#{Config.height}" : "#{Config.width}x"
41
+ end
42
+ end
43
+ end
44
+ end
45
+
46
+ def webshot
47
+ @webshot ||= Webshot::Screenshot.instance
48
+ end
49
+
50
+ # The webshot gem messes with stdout/stderr streams so we keep it in
51
+ # check by using this method. Also, in some sites (e.g. uown.co) it
52
+ # prints some output to stdout, this is why we override $stdout for
53
+ # the duration of the run.
54
+ def hide_output
55
+ keep_stdout, keep_stderr = $stdout, $stderr
56
+ $stdout, $stderr = StringIO.new, StringIO.new
57
+ yield
58
+ ensure
59
+ $stdout, $stderr = keep_stdout, keep_stderr
60
+ end
61
+ end
62
+ end
@@ -0,0 +1,41 @@
1
+ # All values below are the default values
2
+
3
+ # log level (0-4) 0=DEBUG 1=INFO 2=WARN 3=ERROR 4=FATAL
4
+ log_level: 1
5
+
6
+ # log_color (yes, no, auto)
7
+ # yes = always show log color
8
+ # no = never use colors
9
+ # auto = only use colors when running in an interactive terminal
10
+ log_color: auto
11
+
12
+ # number of levels to crawl, 0 means capture only the root URL
13
+ depth: 1
14
+
15
+ # screenshot width in pixels
16
+ width: 1280
17
+
18
+ # screenshot height in pixels, 0 means the entire height
19
+ height: 0
20
+
21
+ # number of seconds to consider the page cache and its screenshot fresh
22
+ cache_life: 86400
23
+
24
+ # where to store the HTML page cache
25
+ cache_dir: cache
26
+
27
+ # where to store screenshots
28
+ snaps_dir: snaps
29
+
30
+ # screenshot filename template, where '%{url}' will be replaced with a
31
+ # slug version of the URL (no need to include the .png extension)
32
+ name_template: '%{url}'
33
+
34
+ # urls not matching this regular expression will be ignored
35
+ url_whitelist:
36
+
37
+ # urls matching this regular expression will be ignored
38
+ url_blacklist:
39
+
40
+ # take a screenshot of this CSS selector only
41
+ css_selector:
@@ -0,0 +1,26 @@
1
+ Snapcrawl
2
+
3
+ Usage:
4
+ snapcrawl URL [--config FILE] [SETTINGS...]
5
+ snapcrawl -h | --help
6
+ snapcrawl -v | --version
7
+
8
+ Options:
9
+ -c, --config FILE
10
+ Path to config file, with or without the .yml extension
11
+ A sample file will be created if not found
12
+ [default: snapcrawl.yml]
13
+
14
+ -h, --help
15
+ Show this screen
16
+
17
+ -v, --version
18
+ Show version number
19
+
20
+ Settings:
21
+ You may provide any of the options available in the config as 'key=value'.
22
+
23
+ Examples:
24
+ snapcrawl example.com
25
+ snapcrawl example.com --config simple
26
+ snapcrawl example.com depth=1 log_level=2
@@ -1,3 +1,3 @@
1
1
  module Snapcrawl
2
- VERSION = "0.4.4"
2
+ VERSION = "0.5.0.rc1"
3
3
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: snapcrawl
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.4.4
4
+ version: 0.5.0.rc1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Danny Ben Shitrit
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2020-03-12 00:00:00.000000000 Z
11
+ date: 2020-03-14 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: colsole
@@ -16,48 +16,42 @@ dependencies:
16
16
  requirements:
17
17
  - - "~>"
18
18
  - !ruby/object:Gem::Version
19
- version: '0.5'
20
- - - ">="
21
- - !ruby/object:Gem::Version
22
- version: 0.5.4
19
+ version: '0.7'
23
20
  type: :runtime
24
21
  prerelease: false
25
22
  version_requirements: !ruby/object:Gem::Requirement
26
23
  requirements:
27
24
  - - "~>"
28
25
  - !ruby/object:Gem::Version
29
- version: '0.5'
30
- - - ">="
31
- - !ruby/object:Gem::Version
32
- version: 0.5.4
26
+ version: '0.7'
33
27
  - !ruby/object:Gem::Dependency
34
28
  name: docopt
35
29
  requirement: !ruby/object:Gem::Requirement
36
30
  requirements:
37
31
  - - "~>"
38
32
  - !ruby/object:Gem::Version
39
- version: '0.5'
33
+ version: '0.6'
40
34
  type: :runtime
41
35
  prerelease: false
42
36
  version_requirements: !ruby/object:Gem::Requirement
43
37
  requirements:
44
38
  - - "~>"
45
39
  - !ruby/object:Gem::Version
46
- version: '0.5'
40
+ version: '0.6'
47
41
  - !ruby/object:Gem::Dependency
48
42
  name: nokogiri
49
43
  requirement: !ruby/object:Gem::Requirement
50
44
  requirements:
51
45
  - - "~>"
52
46
  - !ruby/object:Gem::Version
53
- version: '1.6'
47
+ version: '1.10'
54
48
  type: :runtime
55
49
  prerelease: false
56
50
  version_requirements: !ruby/object:Gem::Requirement
57
51
  requirements:
58
52
  - - "~>"
59
53
  - !ruby/object:Gem::Version
60
- version: '1.6'
54
+ version: '1.10'
61
55
  - !ruby/object:Gem::Dependency
62
56
  name: webshot
63
57
  requirement: !ruby/object:Gem::Requirement
@@ -78,14 +72,14 @@ dependencies:
78
72
  requirements:
79
73
  - - "~>"
80
74
  - !ruby/object:Gem::Version
81
- version: '0.17'
75
+ version: '0.18'
82
76
  type: :runtime
83
77
  prerelease: false
84
78
  version_requirements: !ruby/object:Gem::Requirement
85
79
  requirements:
86
80
  - - "~>"
87
81
  - !ruby/object:Gem::Version
88
- version: '0.17'
82
+ version: '0.18'
89
83
  - !ruby/object:Gem::Dependency
90
84
  name: addressable
91
85
  requirement: !ruby/object:Gem::Requirement
@@ -100,6 +94,34 @@ dependencies:
100
94
  - - "~>"
101
95
  - !ruby/object:Gem::Version
102
96
  version: '2.7'
97
+ - !ruby/object:Gem::Dependency
98
+ name: lightly
99
+ requirement: !ruby/object:Gem::Requirement
100
+ requirements:
101
+ - - "~>"
102
+ - !ruby/object:Gem::Version
103
+ version: '0.3'
104
+ type: :runtime
105
+ prerelease: false
106
+ version_requirements: !ruby/object:Gem::Requirement
107
+ requirements:
108
+ - - "~>"
109
+ - !ruby/object:Gem::Version
110
+ version: '0.3'
111
+ - !ruby/object:Gem::Dependency
112
+ name: sting
113
+ requirement: !ruby/object:Gem::Requirement
114
+ requirements:
115
+ - - "~>"
116
+ - !ruby/object:Gem::Version
117
+ version: '0.4'
118
+ type: :runtime
119
+ prerelease: false
120
+ version_requirements: !ruby/object:Gem::Requirement
121
+ requirements:
122
+ - - "~>"
123
+ - !ruby/object:Gem::Version
124
+ version: '0.4'
103
125
  description: Snapcrawl is a command line utility for crawling a website and saving
104
126
  screenshots.
105
127
  email: db@dannyben.com
@@ -111,9 +133,19 @@ files:
111
133
  - README.md
112
134
  - bin/snapcrawl
113
135
  - lib/snapcrawl.rb
136
+ - lib/snapcrawl/cli.rb
137
+ - lib/snapcrawl/config.rb
114
138
  - lib/snapcrawl/crawler.rb
115
- - lib/snapcrawl/docopt.txt
139
+ - lib/snapcrawl/dependencies.rb
116
140
  - lib/snapcrawl/exceptions.rb
141
+ - lib/snapcrawl/log_helpers.rb
142
+ - lib/snapcrawl/page.rb
143
+ - lib/snapcrawl/pretty_logger.rb
144
+ - lib/snapcrawl/refinements/pair_split.rb
145
+ - lib/snapcrawl/refinements/string_refinements.rb
146
+ - lib/snapcrawl/screenshot.rb
147
+ - lib/snapcrawl/templates/config.yml
148
+ - lib/snapcrawl/templates/docopt.txt
117
149
  - lib/snapcrawl/version.rb
118
150
  homepage: https://github.com/DannyBen/snapcrawl
119
151
  licenses:
@@ -130,9 +162,9 @@ required_ruby_version: !ruby/object:Gem::Requirement
130
162
  version: '2.3'
131
163
  required_rubygems_version: !ruby/object:Gem::Requirement
132
164
  requirements:
133
- - - ">="
165
+ - - ">"
134
166
  - !ruby/object:Gem::Version
135
- version: '0'
167
+ version: 1.3.1
136
168
  requirements: []
137
169
  rubygems_version: 3.0.3
138
170
  signing_key:
@@ -1,48 +0,0 @@
1
- Snapcrawl
2
-
3
- Usage:
4
- snapcrawl URL [options]
5
- snapcrawl -h | --help
6
- snapcrawl -v | --version
7
-
8
- Options:
9
- -f, --folder PATH
10
- Where to save screenshots [default: snaps]
11
-
12
- -n, --name TEMPLATE
13
- Filename template. Include the string '%{url}' anywhere in the name to
14
- use the captured URL in the filename [default: %{url}]
15
-
16
- -a, --age SECONDS
17
- Number of seconds to consider screenshots fresh [default: 86400]
18
-
19
- -d, --depth LEVELS
20
- Number of levels to crawl [default: 1]
21
-
22
- -W, --width PIXELS
23
- Screen width in pixels [default: 1280]
24
-
25
- -H, --height PIXELS
26
- Screen height in pixels. Use 0 to capture the full page [default: 0]
27
-
28
- -s, --selector SELECTOR
29
- CSS selector to capture
30
-
31
- -o, --only REGEX
32
- Include only URLs that match REGEX
33
-
34
- -h, --help
35
- Show this screen
36
-
37
- -v, --version
38
- Show version number
39
-
40
- Examples:
41
- snapcrawl example.com
42
- snapcrawl example.com -d2 -fscreens
43
- snapcrawl example.com -d2 > out.txt 2> err.txt &
44
- snapcrawl example.com -W360 -H480
45
- snapcrawl example.com --selector "#main-content"
46
- snapcrawl example.com --only "products|collections"
47
- snapcrawl example.com --name "screenshot-%{url}"
48
- snapcrawl example.com --name "`date +%Y%m%d`_%{url}"