broken_link_finder 0.9.5 → 0.10.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 4faeb020322d2882bb43c777d0743f964c364261f33bf62712f7e366f1230762
4
- data.tar.gz: 133e61eb8880de6464c2537372d9ac4e82fbdfa36e113e190a4afa141fd77b0d
3
+ metadata.gz: 7a53784c1bd2f75c18b3492ea782b4cc2e229a94f89afcf33b60ef633512554e
4
+ data.tar.gz: 393dca220b7f00d72314c93e7b877e0412afdf784fa2e563bbecb2dc6c6b29f7
5
5
  SHA512:
6
- metadata.gz: 69dc743ce7965125e5c0f5edff817c83f558ff954660a405ef838f7b05437217ca5d287e8d0aa789265b74b0e05e035488ebfe7604a5a1cc92dba67caa331e25
7
- data.tar.gz: 667a2341c12d7b39475391e258827b6b6bd425d141e9349c6c0111b7871432376f91247c1673d8980520afd8d7d4865e38c3cb3f6d3d51a4ae56af8ed617206d
6
+ metadata.gz: c0d304e5b0a9258265c5c084c0a6e5819c169ba8eb02b3c6317a37784a9ca12982b0fc520c3cca1060fde60126ee936708d7891c69133c5d72c9c0287a79b3f5
7
+ data.tar.gz: c21a4aec2c077e2617fb625debad28f746148ad98229a27a590a4412601e30759c709aa3a6e6d80e81c16160e16968fc0392181fc9c75e4da06578452f7c5ab6
@@ -9,6 +9,17 @@
9
9
  - ...
10
10
  ---
11
11
 
12
+ ## v0.10.0
13
+ ### Added
14
+ - A `--html` flag to the `crawl` executable command which produces a HTML report (instead of text).
15
+ - Added a 'retry' mechanism for any broken links found. This is essentially a verification step before generating a report.
16
+ - `Finder#crawl_stats` for info such as crawl duration, total links crawled etc.
17
+ ### Changed/Removed
18
+ - The API has changed somewhat. See the [docs](https://www.rubydoc.info/gems/broken_link_finder) for the up to date code signatures if you're using `broken_link_finder` outside of its executable.
19
+ ### Fixed
20
+ - ...
21
+ ---
22
+
12
23
  ## v0.9.5
13
24
  ### Added
14
25
  - ...
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- broken_link_finder (0.9.5)
4
+ broken_link_finder (0.10.0)
5
5
  thor (~> 0.20)
6
6
  thread (~> 0.2)
7
7
  wgit (~> 0.5)
@@ -18,7 +18,7 @@ GEM
18
18
  safe_yaml (~> 1.0.0)
19
19
  ethon (0.12.0)
20
20
  ffi (>= 1.3.0)
21
- ffi (1.11.2)
21
+ ffi (1.11.3)
22
22
  hashdiff (1.0.0)
23
23
  maxitest (3.4.0)
24
24
  minitest (>= 5.0.0, < 5.13.0)
@@ -65,4 +65,4 @@ RUBY VERSION
65
65
  ruby 2.5.3p105
66
66
 
67
67
  BUNDLED WITH
68
- 2.0.1
68
+ 2.0.2
data/README.md CHANGED
@@ -57,7 +57,7 @@ Installing this gem installs the `broken_link_finder` executable into your `$PAT
57
57
 
58
58
  $ broken_link_finder crawl http://txti.es
59
59
 
60
- Adding the `-r` flag would crawl the entire `txti.es` site, not just its index page.
60
+ Adding the `--recursive` flag would crawl the entire `txti.es` site, not just its index page.
61
61
 
62
62
  See the [output](#Output) section below for an example of a site with broken links.
63
63
 
@@ -76,7 +76,7 @@ require 'broken_link_finder'
76
76
 
77
77
  finder = BrokenLinkFinder.new
78
78
  finder.crawl_site 'http://txti.es' # Or use Finder#crawl_page for a single webpage.
79
- finder.pretty_print_link_report # Or use Finder#broken_links and Finder#ignored_links
79
+ finder.report # Or use Finder#broken_links and Finder#ignored_links
80
80
  # for direct access to the link Hashes.
81
81
  ```
82
82
 
@@ -91,13 +91,15 @@ See the full source code documentation [here](https://www.rubydoc.info/gems/brok
91
91
  If broken links are found then the output will look something like:
92
92
 
93
93
  ```text
94
+ Crawled http://txti.es (7 page(s) in 7.88 seconds)
95
+
94
96
  Found 6 broken link(s) across 2 page(s):
95
97
 
96
98
  The following broken links were found on 'http://txti.es/about':
97
99
  http://twitter.com/thebarrytone
100
+ /doesntexist
98
101
  http://twitter.com/nwbld
99
- http://twitter.com/txties
100
- https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=84L4BDS86FBUU
102
+ twitter.com/txties
101
103
 
102
104
  The following broken links were found on 'http://txti.es/how':
103
105
  http://en.wikipedia.org/wiki/Markdown
@@ -105,14 +107,16 @@ http://imgur.com
105
107
 
106
108
  Ignored 3 unsupported link(s) across 2 page(s), which you should check manually:
107
109
 
108
- The following links were ignored on http://txti.es:
110
+ The following links were ignored on 'http://txti.es':
109
111
  tel:+13174562564
110
112
  mailto:big.jim@jmail.com
111
113
 
112
- The following links were ignored on http://txti.es/contact:
114
+ The following links were ignored on 'http://txti.es/contact':
113
115
  ftp://server.com
114
116
  ```
115
117
 
118
+ You can provide the `--html` flag if you'd prefer a HTML based report.
119
+
116
120
  ## Contributing
117
121
 
118
122
  Bug reports and feature requests are welcome on [GitHub](https://github.com/michaeltelford/broken-link-finder). Just raise an issue.
@@ -128,11 +132,11 @@ After checking out the repo, run `bin/setup` to install dependencies. Then, run
128
132
  To install this gem onto your local machine, run `bundle exec rake install`.
129
133
 
130
134
  To release a new gem version:
131
- - Update the deps in the `*.gemspec` if necessary
132
- - Update the version number in `version.rb` and add the new version to the `CHANGELOG`
133
- - Run `bundle install`
134
- - Run `bundle exec rake test` ensuring all tests pass
135
- - Run `bundle exec rake compile` ensuring no warnings
136
- - Run `bundle exec rake install && rbenv rehash`
137
- - Manually test the executable
138
- - Run `bundle exec rake release[origin]`
135
+ - Update the deps in the `*.gemspec`, if necessary.
136
+ - Update the version number in `version.rb` and add the new version to the `CHANGELOG`.
137
+ - Run `bundle install`.
138
+ - Run `bundle exec rake test` ensuring all tests pass.
139
+ - Run `bundle exec rake compile` ensuring no warnings.
140
+ - Run `bundle exec rake install && rbenv rehash`.
141
+ - Manually test the executable.
142
+ - Run `bundle exec rake release[origin]`.
@@ -10,15 +10,19 @@ finder = BrokenLinkFinder::Finder.new
10
10
  puts Benchmark.measure { finder.crawl_site url }
11
11
  puts "Links crawled: #{finder.total_links_crawled}"
12
12
 
13
- # http://txti.es page crawl
14
- # Pre threading: 17.5 seconds
15
- # Post threading: 7.5 seconds
13
+ # http://txti.es page crawl with threading
14
+ # Pre: 17.5 seconds
15
+ # Post: 7.5 seconds
16
16
 
17
- # http://txti.es post threading - page vs site crawl
17
+ # http://txti.es with threading - page vs site crawl
18
18
  # Page: 9.526981
19
19
  # Site: 9.732416
20
20
  # Multi-threading crawl_site now yields the same time as a single page
21
21
 
22
- # Large site crawl - post all link recording functionality
22
+ # Large site crawl - all link recording functionality
23
23
  # Pre: 608 seconds with 7665 links crawled
24
24
  # Post: 355 seconds with 1099 links crawled
25
+
26
+ # Large site crawl - retry mechanism
27
+ # Pre: 140 seconds
28
+ # Post: 170 seconds
@@ -5,20 +5,10 @@ require 'bundler/setup'
5
5
  require 'pry'
6
6
  require 'byebug'
7
7
  require 'broken_link_finder'
8
+ require 'logger'
8
9
 
9
- # Monkey patch and log all HTTP requests made during the console.
10
- module Typhoeus
11
- singleton_class.class_eval do
12
- alias_method :orig_get, :get
13
- end
14
-
15
- def self.get(base_url, options = {})
16
- puts "[typhoeus] Sending GET: #{base_url}"
17
- resp = orig_get(base_url, options)
18
- puts "[typhoeus] Status: #{resp.code} (#{resp.body.length} bytes in #{resp.total_time} seconds)"
19
- resp
20
- end
21
- end
10
+ # Logs all HTTP requests.
11
+ Wgit.logger.level = Logger::DEBUG
22
12
 
23
13
  # Call reload to load all recent code changes.
24
14
  def reload
@@ -39,6 +29,6 @@ by_link = Finder.new sort: :link
39
29
  finder = by_page
40
30
 
41
31
  # Start the console.
42
- puts "\nbroken_link_finder v#{BrokenLinkFinder::VERSION}"
32
+ puts "\nbroken_link_finder v#{BrokenLinkFinder::VERSION} (#{Wgit.version_str})"
43
33
 
44
34
  binding.pry
@@ -9,12 +9,14 @@ class BrokenLinkFinderCLI < Thor
9
9
  desc 'crawl [URL]', 'Find broken links at the URL'
10
10
  option :recursive, type: :boolean, aliases: [:r], default: false, desc: 'Crawl the entire site.'
11
11
  option :threads, type: :numeric, aliases: [:t], default: BrokenLinkFinder::DEFAULT_MAX_THREADS, desc: 'Max number of threads to use when crawling recursively; 1 thread per web page.'
12
+ option :html, type: :boolean, aliases: [:h], default: false, desc: 'Produce a HTML report (instead of text)'
12
13
  option :sort_by_link, type: :boolean, aliases: [:l], default: false, desc: 'Makes report more concise if there are more pages crawled than broken links found. Use with -r on medium/large sites.'
13
14
  option :verbose, type: :boolean, aliases: [:v], default: false, desc: 'Display all ignored links.'
14
15
  option :concise, type: :boolean, aliases: [:c], default: false, desc: 'Display only a summary of broken links.'
15
16
  def crawl(url)
16
17
  url = "http://#{url}" unless url.start_with?('http')
17
18
 
19
+ report_type = options[:html] ? :html : :text
18
20
  sort_by = options[:sort_by_link] ? :link : :page
19
21
  max_threads = options[:threads]
20
22
  broken_verbose = !options[:concise]
@@ -22,8 +24,9 @@ class BrokenLinkFinderCLI < Thor
22
24
 
23
25
  finder = BrokenLinkFinder::Finder.new(sort: sort_by, max_threads: max_threads)
24
26
  options[:recursive] ? finder.crawl_site(url) : finder.crawl_page(url)
25
- finder.pretty_print_link_report(
26
- broken_verbose: broken_verbose,
27
+ finder.report(
28
+ type: report_type,
29
+ broken_verbose: broken_verbose,
27
30
  ignored_verbose: ignored_verbose
28
31
  )
29
32
  rescue Exception => e
@@ -2,8 +2,12 @@
2
2
 
3
3
  require 'wgit'
4
4
  require 'wgit/core_ext'
5
+ require 'thread/pool'
6
+ require 'set'
5
7
 
6
8
  require_relative './broken_link_finder/wgit_extensions'
7
9
  require_relative './broken_link_finder/version'
8
- require_relative './broken_link_finder/reporter'
10
+ require_relative './broken_link_finder/reporter/reporter'
11
+ require_relative './broken_link_finder/reporter/text_reporter'
12
+ require_relative './broken_link_finder/reporter/html_reporter'
9
13
  require_relative './broken_link_finder/finder'
@@ -1,9 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
- require_relative 'reporter'
4
- require 'thread/pool'
5
- require 'set'
6
-
7
3
  module BrokenLinkFinder
8
4
  DEFAULT_MAX_THREADS = 100
9
5
 
@@ -13,7 +9,7 @@ module BrokenLinkFinder
13
9
  end
14
10
 
15
11
  class Finder
16
- attr_reader :sort, :broken_links, :ignored_links, :total_links_crawled, :max_threads
12
+ attr_reader :sort, :max_threads, :broken_links, :ignored_links, :crawl_stats
17
13
 
18
14
  # Creates a new Finder instance.
19
15
  def initialize(sort: :page, max_threads: BrokenLinkFinder::DEFAULT_MAX_THREADS)
@@ -25,35 +21,38 @@ module BrokenLinkFinder
25
21
  @lock = Mutex.new
26
22
  @crawler = Wgit::Crawler.new
27
23
 
28
- clear_links
24
+ reset_crawl
29
25
  end
30
26
 
31
27
  # Clear/empty the link collection Hashes.
32
- def clear_links
28
+ def reset_crawl
33
29
  @broken_links = {}
34
30
  @ignored_links = {}
35
- @total_links_crawled = 0
36
- @all_broken_links = Set.new
37
- @all_intact_links = Set.new
31
+ @all_broken_links = Set.new # Used to prevent crawling a link twice.
32
+ @all_intact_links = Set.new # "
33
+ @broken_link_map = {} # Maps a link to its absolute form.
34
+ @crawl_stats = {} # Records crawl stats e.g. duration etc.
38
35
  end
39
36
 
40
37
  # Finds broken links within a single page and appends them to the
41
38
  # @broken_links array. Returns true if at least one broken link was found.
42
39
  # Access the broken links afterwards with Finder#broken_links.
43
40
  def crawl_url(url)
44
- clear_links
41
+ reset_crawl
45
42
 
46
- url = url.to_url
47
- doc = @crawler.crawl(url)
43
+ start = Time.now
44
+ url = url.to_url
45
+ doc = @crawler.crawl(url)
48
46
 
49
47
  # Ensure the given page url is valid.
50
48
  raise "Invalid or broken URL: #{url}" unless doc
51
49
 
52
50
  # Get all page links and determine which are broken.
53
51
  find_broken_links(doc)
52
+ retry_broken_links
54
53
 
55
54
  sort_links
56
- set_total_links_crawled
55
+ set_crawl_stats(url: url, pages_crawled: [url], start: start)
57
56
 
58
57
  @broken_links.any?
59
58
  end
@@ -63,15 +62,16 @@ module BrokenLinkFinder
63
62
  # at least one broken link was found and an Array of all pages crawled.
64
63
  # Access the broken links afterwards with Finder#broken_links.
65
64
  def crawl_site(url)
66
- clear_links
65
+ reset_crawl
67
66
 
68
- url = url.to_url
69
- pool = Thread.pool(@max_threads)
70
- crawled_pages = []
67
+ start = Time.now
68
+ url = url.to_url
69
+ pool = Thread.pool(@max_threads)
70
+ crawled = Set.new
71
71
 
72
72
  # Crawl the site's HTML web pages looking for links.
73
73
  externals = @crawler.crawl_site(url) do |doc|
74
- crawled_pages << doc.url
74
+ crawled << doc.url
75
75
  next unless doc
76
76
 
77
77
  # Start a thread for each page, checking for broken links.
@@ -83,30 +83,31 @@ module BrokenLinkFinder
83
83
 
84
84
  # Wait for all threads to finish.
85
85
  pool.shutdown
86
+ retry_broken_links
86
87
 
87
88
  sort_links
88
- set_total_links_crawled
89
+ set_crawl_stats(url: url, pages_crawled: crawled.to_a, start: start)
89
90
 
90
- [@broken_links.any?, crawled_pages.uniq]
91
+ @broken_links.any?
91
92
  end
92
93
 
93
94
  # Pretty prints the link report into a stream e.g. STDOUT or a file,
94
95
  # anything that respond_to? :puts. Defaults to STDOUT.
95
- # Returns true if there were broken links and vice versa.
96
- def pretty_print_link_report(
97
- stream = STDOUT,
98
- broken_verbose: true,
99
- ignored_verbose: false
100
- )
101
- reporter = BrokenLinkFinder::Reporter.new(
102
- stream, @sort, @broken_links, @ignored_links
103
- )
104
- reporter.pretty_print_link_report(
105
- broken_verbose: broken_verbose,
106
- ignored_verbose: ignored_verbose
107
- )
108
-
109
- @broken_links.any?
96
+ def report(stream = STDOUT,
97
+ type: :text, broken_verbose: true, ignored_verbose: false)
98
+ klass = case type
99
+ when :text
100
+ BrokenLinkFinder::TextReporter
101
+ when :html
102
+ BrokenLinkFinder::HTMLReporter
103
+ else
104
+ raise "type: must be :text or :html, not: :#{type}"
105
+ end
106
+
107
+ reporter = klass.new(stream, @sort, @broken_links,
108
+ @ignored_links, @broken_link_map, @crawl_stats)
109
+ reporter.call(broken_verbose: broken_verbose,
110
+ ignored_verbose: ignored_verbose)
110
111
  end
111
112
 
112
113
  private
@@ -117,11 +118,11 @@ module BrokenLinkFinder
117
118
 
118
119
  # Iterate over the supported links checking if they're broken or not.
119
120
  links.each do |link|
120
- # Check if the link has already been processed previously.
121
+ # Skip if the link has been processed previously.
121
122
  next if @all_intact_links.include?(link)
122
123
 
123
124
  if @all_broken_links.include?(link)
124
- append_broken_link(page.url, link)
125
+ append_broken_link(page.url, link) # Record on which page.
125
126
  next
126
127
  end
127
128
 
@@ -129,10 +130,8 @@ module BrokenLinkFinder
129
130
  link_doc = crawl_link(page, link)
130
131
 
131
132
  # Determine if the crawled link is broken or not.
132
- if link_doc.nil? ||
133
- @crawler.last_response.not_found? ||
134
- has_broken_anchor(link_doc)
135
- append_broken_link(page.url, link)
133
+ if link_broken?(link_doc)
134
+ append_broken_link(page.url, link, doc: page)
136
135
  else
137
136
  @lock.synchronize { @all_intact_links << link }
138
137
  end
@@ -141,6 +140,17 @@ module BrokenLinkFinder
141
140
  nil
142
141
  end
143
142
 
143
+ # Implements a retry mechanism for each of the broken links found.
144
+ # Removes any broken links found to be working OK.
145
+ def retry_broken_links
146
+ sleep(0.5) # Give the servers a break, then retry the links.
147
+
148
+ @broken_link_map.each do |link, href|
149
+ doc = @crawler.crawl(href)
150
+ remove_broken_link(link) unless link_broken?(doc)
151
+ end
152
+ end
153
+
144
154
  # Report and reject any non supported links. Any link that is absolute and
145
155
  # doesn't start with 'http' is unsupported e.g. 'mailto:blah' etc.
146
156
  def get_supported_links(doc)
@@ -153,12 +163,17 @@ module BrokenLinkFinder
153
163
  end
154
164
  end
155
165
 
156
- # Makes the link absolute and crawls it, returning its Wgit::Document.
166
+ # Make the link absolute and crawl it, returning its Wgit::Document.
157
167
  def crawl_link(doc, link)
158
168
  link = link.prefix_base(doc)
159
169
  @crawler.crawl(link)
160
170
  end
161
171
 
172
+ # Return if the crawled link is broken or not.
173
+ def link_broken?(doc)
174
+ doc.nil? || @crawler.last_response.not_found? || has_broken_anchor(doc)
175
+ end
176
+
162
177
  # Returns true if the link is/contains a broken anchor/fragment.
163
178
  def has_broken_anchor(doc)
164
179
  raise 'link document is nil' unless doc
@@ -170,7 +185,8 @@ module BrokenLinkFinder
170
185
  end
171
186
 
172
187
  # Append key => [value] to @broken_links.
173
- def append_broken_link(url, link)
188
+ # If doc: is provided then the link will be recorded in absolute form.
189
+ def append_broken_link(url, link, doc: nil)
174
190
  key, value = get_key_value(url, link)
175
191
 
176
192
  @lock.synchronize do
@@ -178,6 +194,23 @@ module BrokenLinkFinder
178
194
  @broken_links[key] << value
179
195
 
180
196
  @all_broken_links << link
197
+
198
+ @broken_link_map[link] = link.prefix_base(doc) if doc
199
+ end
200
+ end
201
+
202
+ # Remove the broken_link from the necessary collections.
203
+ def remove_broken_link(link)
204
+ @lock.synchronize do
205
+ if @sort == :page
206
+ @broken_links.each { |_k, links| links.delete(link) }
207
+ @broken_links.delete_if { |_k, links| links.empty? }
208
+ else
209
+ @broken_links.delete(link)
210
+ end
211
+
212
+ @all_broken_links.delete(link)
213
+ @all_intact_links << link
181
214
  end
182
215
  end
183
216
 
@@ -217,12 +250,15 @@ module BrokenLinkFinder
217
250
  end
218
251
 
219
252
  # Sets and returns the total number of links crawled.
220
- def set_total_links_crawled
221
- @total_links_crawled = @all_broken_links.size + @all_intact_links.size
253
+ def set_crawl_stats(url:, pages_crawled:, start:)
254
+ @crawl_stats[:url] = url
255
+ @crawl_stats[:pages_crawled] = pages_crawled
256
+ @crawl_stats[:num_pages] = pages_crawled.size
257
+ @crawl_stats[:num_links] = @all_broken_links.size + @all_intact_links.size
258
+ @crawl_stats[:duration] = Time.now - start
222
259
  end
223
260
 
224
- alias crawl_page crawl_url
225
- alias crawl_r crawl_site
226
- alias pretty_print_link_summary pretty_print_link_report
261
+ alias crawl_page crawl_url
262
+ alias crawl_r crawl_site
227
263
  end
228
264
  end
@@ -0,0 +1,134 @@
1
+ # frozen_string_literal: true
2
+
3
+ module BrokenLinkFinder
4
+ class HTMLReporter < Reporter
5
+ # Creates a new HTMLReporter instance.
6
+ # stream is any Object that responds to :puts and :print.
7
+ def initialize(stream, sort,
8
+ broken_links, ignored_links,
9
+ broken_link_map, crawl_stats)
10
+ super
11
+ end
12
+
13
+ # Pretty print a report detailing the full link summary.
14
+ def call(broken_verbose: true, ignored_verbose: false)
15
+ puts '<div class="broken_link_finder_report">'
16
+
17
+ report_crawl_summary
18
+ report_broken_links(verbose: broken_verbose)
19
+ report_ignored_links(verbose: ignored_verbose)
20
+
21
+ puts '</div>'
22
+
23
+ nil
24
+ end
25
+
26
+ private
27
+
28
+ # Report a summary of the overall crawl.
29
+ def report_crawl_summary
30
+ puts format(
31
+ '<p class="crawl_summary">Crawled %s (%s page(s) in %s seconds)</p>',
32
+ @crawl_stats[:url],
33
+ @crawl_stats[:num_pages],
34
+ @crawl_stats[:duration]&.truncate(2)
35
+ )
36
+ end
37
+
38
+ # Report a summary of the broken links.
39
+ def report_broken_links(verbose: true)
40
+ puts '<div class="broken_links">'
41
+
42
+ if @broken_links.empty?
43
+ puts_summary 'Good news, there are no broken links!', type: :broken
44
+ else
45
+ num_pages, num_links = get_hash_stats(@broken_links)
46
+ puts_summary "Found #{num_links} broken link(s) across #{num_pages} page(s):", type: :broken
47
+
48
+ @broken_links.each do |key, values|
49
+ puts_group(key, type: :broken) # Puts the opening <p> element.
50
+
51
+ if verbose || (values.length <= NUM_VALUES)
52
+ values.each { |value| puts_group_item value, type: :broken }
53
+ else # Only print N values and summarise the rest.
54
+ NUM_VALUES.times { |i| puts_group_item values[i], type: :broken }
55
+
56
+ objects = sort_by_page? ? 'link(s)' : 'page(s)'
57
+ puts "+ #{values.length - NUM_VALUES} other #{objects}, remove --concise to see them all<br />"
58
+ end
59
+
60
+ puts '</p>'
61
+ end
62
+ end
63
+
64
+ puts '</div>'
65
+ end
66
+
67
+ # Report a summary of the ignored links.
68
+ def report_ignored_links(verbose: false)
69
+ puts '<div class="ignored_links">'
70
+
71
+ if @ignored_links.any?
72
+ num_pages, num_links = get_hash_stats(@ignored_links)
73
+ puts_summary "Ignored #{num_links} unsupported link(s) across #{num_pages} page(s), which you should check manually:", type: :ignored
74
+
75
+ @ignored_links.each do |key, values|
76
+ puts_group(key, type: :ignored) # Puts the opening <p> element.
77
+
78
+ if verbose || (values.length <= NUM_VALUES)
79
+ values.each { |value| puts_group_item value, type: :ignored }
80
+ else # Only print N values and summarise the rest.
81
+ NUM_VALUES.times { |i| puts_group_item values[i], type: :ignored }
82
+
83
+ objects = sort_by_page? ? 'link(s)' : 'page(s)'
84
+ puts "+ #{values.length - NUM_VALUES} other #{objects}, use --verbose to see them all<br />"
85
+ end
86
+
87
+ puts '</p>'
88
+ end
89
+ end
90
+
91
+ puts '</div>'
92
+ end
93
+
94
+ def puts_summary(text, type:)
95
+ klass = (type == :broken) ? 'broken_links_summary' : 'ignored_links_summary'
96
+ puts "<p class=\"#{klass}\">#{text}</p>"
97
+ end
98
+
99
+ def puts_group(link, type:)
100
+ href = build_url(link)
101
+ a_element = "<a href=\"#{href}\">#{link}</a>"
102
+
103
+ case type
104
+ when :broken
105
+ msg = sort_by_page? ?
106
+ "The following broken links were found on '#{a_element}':" :
107
+ "The broken link '#{a_element}' was found on the following pages:"
108
+ klass = 'broken_links_group'
109
+ when :ignored
110
+ msg = sort_by_page? ?
111
+ "The following links were ignored on '#{a_element}':" :
112
+ "The link '#{a_element}' was ignored on the following pages:"
113
+ klass = 'ignored_links_group'
114
+ else
115
+ raise "type: must be :broken or :ignored, not: #{type}"
116
+ end
117
+
118
+ puts "<p class=\"#{klass}\">"
119
+ puts msg + '<br />'
120
+ end
121
+
122
+ def puts_group_item(value, type:)
123
+ klass = (type == :broken) ? 'broken_links_group_item' : 'ignored_links_group_item'
124
+ puts "<a class=\"#{klass}\" href=\"#{build_url(value)}\">#{value}</a><br />"
125
+ end
126
+
127
+ def build_url(link)
128
+ return link if link.to_url.absolute?
129
+ @broken_link_map.fetch(link)
130
+ end
131
+
132
+ alias_method :report, :call
133
+ end
134
+ end
@@ -0,0 +1,77 @@
1
+ # frozen_string_literal: true
2
+
3
+ module BrokenLinkFinder
4
+ # Generic reporter class to be inherited from by format specific reporters.
5
+ class Reporter
6
+ # The amount of pages/links to display when verbose is false.
7
+ NUM_VALUES = 3
8
+
9
+ # Creates a new Reporter instance.
10
+ # stream is any Object that responds to :puts and :print.
11
+ def initialize(stream, sort,
12
+ broken_links, ignored_links,
13
+ broken_link_map, crawl_stats)
14
+ unless stream.respond_to?(:puts) && stream.respond_to?(:print)
15
+ raise 'stream must respond_to? :puts and :print'
16
+ end
17
+ raise "sort by either :page or :link, not #{sort}" \
18
+ unless %i[page link].include?(sort)
19
+
20
+ @stream = stream
21
+ @sort = sort
22
+ @broken_links = broken_links
23
+ @ignored_links = ignored_links
24
+ @broken_link_map = broken_link_map
25
+ @crawl_stats = crawl_stats
26
+ end
27
+
28
+ # Pretty print a report detailing the full link summary.
29
+ def call(broken_verbose: true, ignored_verbose: false)
30
+ raise 'Not implemented by parent class'
31
+ end
32
+
33
+ protected
34
+
35
+ # Return true if the sort is by page.
36
+ def sort_by_page?
37
+ @sort == :page
38
+ end
39
+
40
+ # Returns the key/value statistics of hash e.g. the number of keys and
41
+ # combined values. The hash should be of the format: { 'str' => [...] }.
42
+ # Use like: `num_pages, num_links = get_hash_stats(links)`.
43
+ def get_hash_stats(hash)
44
+ num_keys = hash.keys.length
45
+ values = hash.values.flatten
46
+ num_values = sort_by_page? ? values.length : values.uniq.length
47
+
48
+ sort_by_page? ?
49
+ [num_keys, num_values] :
50
+ [num_values, num_keys]
51
+ end
52
+
53
+ # Prints the text. Defaults to a blank line.
54
+ def print(text = '')
55
+ @stream.print(text)
56
+ end
57
+
58
+ # Prints the text + \n. Defaults to a blank line.
59
+ def puts(text = '')
60
+ @stream.puts(text)
61
+ end
62
+
63
+ # Prints text + \n\n.
64
+ def putsn(text)
65
+ puts(text)
66
+ puts
67
+ end
68
+
69
+ # Prints \n + text + \n.
70
+ def nputs(text)
71
+ puts
72
+ puts(text)
73
+ end
74
+
75
+ alias_method :report, :call
76
+ end
77
+ end
@@ -0,0 +1,86 @@
1
+ # frozen_string_literal: true
2
+
3
+ module BrokenLinkFinder
4
+ class TextReporter < Reporter
5
+ # Creates a new TextReporter instance.
6
+ # stream is any Object that responds to :puts and :print.
7
+ def initialize(stream, sort,
8
+ broken_links, ignored_links,
9
+ broken_link_map, crawl_stats)
10
+ super
11
+ end
12
+
13
+ # Pretty print a report detailing the full link summary.
14
+ def call(broken_verbose: true, ignored_verbose: false)
15
+ report_crawl_summary
16
+ report_broken_links(verbose: broken_verbose)
17
+ report_ignored_links(verbose: ignored_verbose)
18
+
19
+ nil
20
+ end
21
+
22
+ private
23
+
24
+ # Report a summary of the overall crawl.
25
+ def report_crawl_summary
26
+ putsn format(
27
+ 'Crawled %s (%s page(s) in %s seconds)',
28
+ @crawl_stats[:url],
29
+ @crawl_stats[:num_pages],
30
+ @crawl_stats[:duration]&.truncate(2)
31
+ )
32
+ end
33
+
34
+ # Report a summary of the broken links.
35
+ def report_broken_links(verbose: true)
36
+ if @broken_links.empty?
37
+ puts 'Good news, there are no broken links!'
38
+ else
39
+ num_pages, num_links = get_hash_stats(@broken_links)
40
+ puts "Found #{num_links} broken link(s) across #{num_pages} page(s):"
41
+
42
+ @broken_links.each do |key, values|
43
+ msg = sort_by_page? ?
44
+ "The following broken links were found on '#{key}':" :
45
+ "The broken link '#{key}' was found on the following pages:"
46
+ nputs msg
47
+
48
+ if verbose || (values.length <= NUM_VALUES)
49
+ values.each { |value| puts value }
50
+ else # Only print N values and summarise the rest.
51
+ NUM_VALUES.times { |i| puts values[i] }
52
+
53
+ objects = sort_by_page? ? 'link(s)' : 'page(s)'
54
+ puts "+ #{values.length - NUM_VALUES} other #{objects}, remove --concise to see them all"
55
+ end
56
+ end
57
+ end
58
+ end
59
+
60
+ # Report a summary of the ignored links.
61
+ def report_ignored_links(verbose: false)
62
+ if @ignored_links.any?
63
+ num_pages, num_links = get_hash_stats(@ignored_links)
64
+ nputs "Ignored #{num_links} unsupported link(s) across #{num_pages} page(s), which you should check manually:"
65
+
66
+ @ignored_links.each do |key, values|
67
+ msg = sort_by_page? ?
68
+ "The following links were ignored on '#{key}':" :
69
+ "The link '#{key}' was ignored on the following pages:"
70
+ nputs msg
71
+
72
+ if verbose || (values.length <= NUM_VALUES)
73
+ values.each { |value| puts value }
74
+ else # Only print N values and summarise the rest.
75
+ NUM_VALUES.times { |i| puts values[i] }
76
+
77
+ objects = sort_by_page? ? 'link(s)' : 'page(s)'
78
+ puts "+ #{values.length - NUM_VALUES} other #{objects}, use --verbose to see them all"
79
+ end
80
+ end
81
+ end
82
+ end
83
+
84
+ alias_method :report, :call
85
+ end
86
+ end
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module BrokenLinkFinder
4
- VERSION = '0.9.5'
4
+ VERSION = '0.10.0'
5
5
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: broken_link_finder
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.9.5
4
+ version: 0.10.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Michael Telford
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2019-11-22 00:00:00.000000000 Z
11
+ date: 2019-11-28 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: bundler
@@ -159,7 +159,9 @@ files:
159
159
  - exe/broken_link_finder
160
160
  - lib/broken_link_finder.rb
161
161
  - lib/broken_link_finder/finder.rb
162
- - lib/broken_link_finder/reporter.rb
162
+ - lib/broken_link_finder/reporter/html_reporter.rb
163
+ - lib/broken_link_finder/reporter/reporter.rb
164
+ - lib/broken_link_finder/reporter/text_reporter.rb
163
165
  - lib/broken_link_finder/version.rb
164
166
  - lib/broken_link_finder/wgit_extensions.rb
165
167
  - load.rb
@@ -187,8 +189,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
187
189
  - !ruby/object:Gem::Version
188
190
  version: '0'
189
191
  requirements: []
190
- rubyforge_project:
191
- rubygems_version: 2.7.6
192
+ rubygems_version: 3.0.6
192
193
  signing_key:
193
194
  specification_version: 4
194
195
  summary: Finds a website's broken links and reports back to you with a summary.
@@ -1,116 +0,0 @@
1
- # frozen_string_literal: true
2
-
3
- module BrokenLinkFinder
4
- class Reporter
5
- # The amount of pages/links to display when verbose is false.
6
- NUM_VALUES = 3
7
-
8
- # Creates a new Reporter instance.
9
- # stream is any Object that responds to :puts.
10
- def initialize(stream, sort, broken_links, ignored_links)
11
- raise 'stream must respond_to? :puts' unless stream.respond_to?(:puts)
12
- raise "sort by either :page or :link, not #{sort}" \
13
- unless %i[page link].include?(sort)
14
-
15
- @stream = stream
16
- @sort = sort
17
- @broken_links = broken_links
18
- @ignored_links = ignored_links
19
- end
20
-
21
- # Pretty print a report detailing the link summary.
22
- def pretty_print_link_report(broken_verbose: true, ignored_verbose: false)
23
- report_broken_links(verbose: broken_verbose)
24
- report_ignored_links(verbose: ignored_verbose)
25
-
26
- nil
27
- end
28
-
29
- private
30
-
31
- # Report a summary of the broken links.
32
- def report_broken_links(verbose: true)
33
- if @broken_links.empty?
34
- print 'Good news, there are no broken links!'
35
- else
36
- num_pages, num_links = get_hash_stats(@broken_links)
37
- print "Found #{num_links} broken link(s) across #{num_pages} page(s):"
38
-
39
- @broken_links.each do |key, values|
40
- msg = sort_by_page? ?
41
- "The following broken links were found on '#{key}':" :
42
- "The broken link '#{key}' was found on the following pages:"
43
- nprint msg
44
-
45
- if verbose || (values.length <= NUM_VALUES)
46
- values.each { |value| print value }
47
- else # Only print N values and summarise the rest.
48
- NUM_VALUES.times { |i| print values[i] }
49
-
50
- objects = sort_by_page? ? 'link(s)' : 'page(s)'
51
- print "+ #{values.length - NUM_VALUES} other #{objects}, remove --concise to see them all"
52
- end
53
- end
54
- end
55
- end
56
-
57
- # Report a summary of the ignored links.
58
- def report_ignored_links(verbose: false)
59
- if @ignored_links.any?
60
- num_pages, num_links = get_hash_stats(@ignored_links)
61
- nprint "Ignored #{num_links} unsupported link(s) across #{num_pages} page(s), which you should check manually:"
62
-
63
- @ignored_links.each do |key, values|
64
- msg = sort_by_page? ?
65
- "The following links were ignored on '#{key}':" :
66
- "The link '#{key}' was ignored on the following pages:"
67
- nprint msg
68
-
69
- if verbose || (values.length <= NUM_VALUES)
70
- values.each { |value| print value }
71
- else # Only print N values and summarise the rest.
72
- NUM_VALUES.times { |i| print values[i] }
73
-
74
- objects = sort_by_page? ? 'link(s)' : 'page(s)'
75
- print "+ #{values.length - NUM_VALUES} other #{objects}, use --verbose to see them all"
76
- end
77
- end
78
- end
79
- end
80
-
81
- # Return true if the sort is by page.
82
- def sort_by_page?
83
- @sort == :page
84
- end
85
-
86
- # Returns the key/value statistics of hash e.g. the number of keys and
87
- # combined values. The hash should be of the format: { 'str' => [...] }.
88
- # Use like: `num_pages, num_links = get_hash_stats(links)`.
89
- def get_hash_stats(hash)
90
- num_keys = hash.keys.length
91
- values = hash.values.flatten
92
- num_values = sort_by_page? ? values.length : values.uniq.length
93
-
94
- sort_by_page? ?
95
- [num_keys, num_values] :
96
- [num_values, num_keys]
97
- end
98
-
99
- # Prints the text + \n. Defaults to a blank line.
100
- def print(text = '')
101
- @stream.puts(text)
102
- end
103
-
104
- # Prints text + \n\n.
105
- def printn(text)
106
- print(text)
107
- print
108
- end
109
-
110
- # Prints \n + text + \n.
111
- def nprint(text)
112
- print
113
- print(text)
114
- end
115
- end
116
- end