broken_link_finder 0.9.5 → 0.10.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 4faeb020322d2882bb43c777d0743f964c364261f33bf62712f7e366f1230762
4
- data.tar.gz: 133e61eb8880de6464c2537372d9ac4e82fbdfa36e113e190a4afa141fd77b0d
3
+ metadata.gz: 7a53784c1bd2f75c18b3492ea782b4cc2e229a94f89afcf33b60ef633512554e
4
+ data.tar.gz: 393dca220b7f00d72314c93e7b877e0412afdf784fa2e563bbecb2dc6c6b29f7
5
5
  SHA512:
6
- metadata.gz: 69dc743ce7965125e5c0f5edff817c83f558ff954660a405ef838f7b05437217ca5d287e8d0aa789265b74b0e05e035488ebfe7604a5a1cc92dba67caa331e25
7
- data.tar.gz: 667a2341c12d7b39475391e258827b6b6bd425d141e9349c6c0111b7871432376f91247c1673d8980520afd8d7d4865e38c3cb3f6d3d51a4ae56af8ed617206d
6
+ metadata.gz: c0d304e5b0a9258265c5c084c0a6e5819c169ba8eb02b3c6317a37784a9ca12982b0fc520c3cca1060fde60126ee936708d7891c69133c5d72c9c0287a79b3f5
7
+ data.tar.gz: c21a4aec2c077e2617fb625debad28f746148ad98229a27a590a4412601e30759c709aa3a6e6d80e81c16160e16968fc0392181fc9c75e4da06578452f7c5ab6
@@ -9,6 +9,17 @@
9
9
  - ...
10
10
  ---
11
11
 
12
+ ## v0.10.0
13
+ ### Added
14
+ - A `--html` flag to the `crawl` executable command which produces a HTML report (instead of text).
15
+ - Added a 'retry' mechanism for any broken links found. This is essentially a verification step before generating a report.
16
+ - `Finder#crawl_stats` for info such as crawl duration, total links crawled etc.
17
+ ### Changed/Removed
18
+ - The API has changed somewhat. See the [docs](https://www.rubydoc.info/gems/broken_link_finder) for the up to date code signatures if you're using `broken_link_finder` outside of its executable.
19
+ ### Fixed
20
+ - ...
21
+ ---
22
+
12
23
  ## v0.9.5
13
24
  ### Added
14
25
  - ...
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- broken_link_finder (0.9.5)
4
+ broken_link_finder (0.10.0)
5
5
  thor (~> 0.20)
6
6
  thread (~> 0.2)
7
7
  wgit (~> 0.5)
@@ -18,7 +18,7 @@ GEM
18
18
  safe_yaml (~> 1.0.0)
19
19
  ethon (0.12.0)
20
20
  ffi (>= 1.3.0)
21
- ffi (1.11.2)
21
+ ffi (1.11.3)
22
22
  hashdiff (1.0.0)
23
23
  maxitest (3.4.0)
24
24
  minitest (>= 5.0.0, < 5.13.0)
@@ -65,4 +65,4 @@ RUBY VERSION
65
65
  ruby 2.5.3p105
66
66
 
67
67
  BUNDLED WITH
68
- 2.0.1
68
+ 2.0.2
data/README.md CHANGED
@@ -57,7 +57,7 @@ Installing this gem installs the `broken_link_finder` executable into your `$PAT
57
57
 
58
58
  $ broken_link_finder crawl http://txti.es
59
59
 
60
- Adding the `-r` flag would crawl the entire `txti.es` site, not just its index page.
60
+ Adding the `--recursive` flag would crawl the entire `txti.es` site, not just its index page.
61
61
 
62
62
  See the [output](#Output) section below for an example of a site with broken links.
63
63
 
@@ -76,7 +76,7 @@ require 'broken_link_finder'
76
76
 
77
77
  finder = BrokenLinkFinder.new
78
78
  finder.crawl_site 'http://txti.es' # Or use Finder#crawl_page for a single webpage.
79
- finder.pretty_print_link_report # Or use Finder#broken_links and Finder#ignored_links
79
+ finder.report # Or use Finder#broken_links and Finder#ignored_links
80
80
  # for direct access to the link Hashes.
81
81
  ```
82
82
 
@@ -91,13 +91,15 @@ See the full source code documentation [here](https://www.rubydoc.info/gems/brok
91
91
  If broken links are found then the output will look something like:
92
92
 
93
93
  ```text
94
+ Crawled http://txti.es (7 page(s) in 7.88 seconds)
95
+
94
96
  Found 6 broken link(s) across 2 page(s):
95
97
 
96
98
  The following broken links were found on 'http://txti.es/about':
97
99
  http://twitter.com/thebarrytone
100
+ /doesntexist
98
101
  http://twitter.com/nwbld
99
- http://twitter.com/txties
100
- https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=84L4BDS86FBUU
102
+ twitter.com/txties
101
103
 
102
104
  The following broken links were found on 'http://txti.es/how':
103
105
  http://en.wikipedia.org/wiki/Markdown
@@ -105,14 +107,16 @@ http://imgur.com
105
107
 
106
108
  Ignored 3 unsupported link(s) across 2 page(s), which you should check manually:
107
109
 
108
- The following links were ignored on http://txti.es:
110
+ The following links were ignored on 'http://txti.es':
109
111
  tel:+13174562564
110
112
  mailto:big.jim@jmail.com
111
113
 
112
- The following links were ignored on http://txti.es/contact:
114
+ The following links were ignored on 'http://txti.es/contact':
113
115
  ftp://server.com
114
116
  ```
115
117
 
118
+ You can provide the `--html` flag if you'd prefer a HTML based report.
119
+
116
120
  ## Contributing
117
121
 
118
122
  Bug reports and feature requests are welcome on [GitHub](https://github.com/michaeltelford/broken-link-finder). Just raise an issue.
@@ -128,11 +132,11 @@ After checking out the repo, run `bin/setup` to install dependencies. Then, run
128
132
  To install this gem onto your local machine, run `bundle exec rake install`.
129
133
 
130
134
  To release a new gem version:
131
- - Update the deps in the `*.gemspec` if necessary
132
- - Update the version number in `version.rb` and add the new version to the `CHANGELOG`
133
- - Run `bundle install`
134
- - Run `bundle exec rake test` ensuring all tests pass
135
- - Run `bundle exec rake compile` ensuring no warnings
136
- - Run `bundle exec rake install && rbenv rehash`
137
- - Manually test the executable
138
- - Run `bundle exec rake release[origin]`
135
+ - Update the deps in the `*.gemspec`, if necessary.
136
+ - Update the version number in `version.rb` and add the new version to the `CHANGELOG`.
137
+ - Run `bundle install`.
138
+ - Run `bundle exec rake test` ensuring all tests pass.
139
+ - Run `bundle exec rake compile` ensuring no warnings.
140
+ - Run `bundle exec rake install && rbenv rehash`.
141
+ - Manually test the executable.
142
+ - Run `bundle exec rake release[origin]`.
@@ -10,15 +10,19 @@ finder = BrokenLinkFinder::Finder.new
10
10
  puts Benchmark.measure { finder.crawl_site url }
11
11
  puts "Links crawled: #{finder.total_links_crawled}"
12
12
 
13
- # http://txti.es page crawl
14
- # Pre threading: 17.5 seconds
15
- # Post threading: 7.5 seconds
13
+ # http://txti.es page crawl with threading
14
+ # Pre: 17.5 seconds
15
+ # Post: 7.5 seconds
16
16
 
17
- # http://txti.es post threading - page vs site crawl
17
+ # http://txti.es with threading - page vs site crawl
18
18
  # Page: 9.526981
19
19
  # Site: 9.732416
20
20
  # Multi-threading crawl_site now yields the same time as a single page
21
21
 
22
- # Large site crawl - post all link recording functionality
22
+ # Large site crawl - all link recording functionality
23
23
  # Pre: 608 seconds with 7665 links crawled
24
24
  # Post: 355 seconds with 1099 links crawled
25
+
26
+ # Large site crawl - retry mechanism
27
+ # Pre: 140 seconds
28
+ # Post: 170 seconds
@@ -5,20 +5,10 @@ require 'bundler/setup'
5
5
  require 'pry'
6
6
  require 'byebug'
7
7
  require 'broken_link_finder'
8
+ require 'logger'
8
9
 
9
- # Monkey patch and log all HTTP requests made during the console.
10
- module Typhoeus
11
- singleton_class.class_eval do
12
- alias_method :orig_get, :get
13
- end
14
-
15
- def self.get(base_url, options = {})
16
- puts "[typhoeus] Sending GET: #{base_url}"
17
- resp = orig_get(base_url, options)
18
- puts "[typhoeus] Status: #{resp.code} (#{resp.body.length} bytes in #{resp.total_time} seconds)"
19
- resp
20
- end
21
- end
10
+ # Logs all HTTP requests.
11
+ Wgit.logger.level = Logger::DEBUG
22
12
 
23
13
  # Call reload to load all recent code changes.
24
14
  def reload
@@ -39,6 +29,6 @@ by_link = Finder.new sort: :link
39
29
  finder = by_page
40
30
 
41
31
  # Start the console.
42
- puts "\nbroken_link_finder v#{BrokenLinkFinder::VERSION}"
32
+ puts "\nbroken_link_finder v#{BrokenLinkFinder::VERSION} (#{Wgit.version_str})"
43
33
 
44
34
  binding.pry
@@ -9,12 +9,14 @@ class BrokenLinkFinderCLI < Thor
9
9
  desc 'crawl [URL]', 'Find broken links at the URL'
10
10
  option :recursive, type: :boolean, aliases: [:r], default: false, desc: 'Crawl the entire site.'
11
11
  option :threads, type: :numeric, aliases: [:t], default: BrokenLinkFinder::DEFAULT_MAX_THREADS, desc: 'Max number of threads to use when crawling recursively; 1 thread per web page.'
12
+ option :html, type: :boolean, aliases: [:h], default: false, desc: 'Produce a HTML report (instead of text)'
12
13
  option :sort_by_link, type: :boolean, aliases: [:l], default: false, desc: 'Makes report more concise if there are more pages crawled than broken links found. Use with -r on medium/large sites.'
13
14
  option :verbose, type: :boolean, aliases: [:v], default: false, desc: 'Display all ignored links.'
14
15
  option :concise, type: :boolean, aliases: [:c], default: false, desc: 'Display only a summary of broken links.'
15
16
  def crawl(url)
16
17
  url = "http://#{url}" unless url.start_with?('http')
17
18
 
19
+ report_type = options[:html] ? :html : :text
18
20
  sort_by = options[:sort_by_link] ? :link : :page
19
21
  max_threads = options[:threads]
20
22
  broken_verbose = !options[:concise]
@@ -22,8 +24,9 @@ class BrokenLinkFinderCLI < Thor
22
24
 
23
25
  finder = BrokenLinkFinder::Finder.new(sort: sort_by, max_threads: max_threads)
24
26
  options[:recursive] ? finder.crawl_site(url) : finder.crawl_page(url)
25
- finder.pretty_print_link_report(
26
- broken_verbose: broken_verbose,
27
+ finder.report(
28
+ type: report_type,
29
+ broken_verbose: broken_verbose,
27
30
  ignored_verbose: ignored_verbose
28
31
  )
29
32
  rescue Exception => e
@@ -2,8 +2,12 @@
2
2
 
3
3
  require 'wgit'
4
4
  require 'wgit/core_ext'
5
+ require 'thread/pool'
6
+ require 'set'
5
7
 
6
8
  require_relative './broken_link_finder/wgit_extensions'
7
9
  require_relative './broken_link_finder/version'
8
- require_relative './broken_link_finder/reporter'
10
+ require_relative './broken_link_finder/reporter/reporter'
11
+ require_relative './broken_link_finder/reporter/text_reporter'
12
+ require_relative './broken_link_finder/reporter/html_reporter'
9
13
  require_relative './broken_link_finder/finder'
@@ -1,9 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
- require_relative 'reporter'
4
- require 'thread/pool'
5
- require 'set'
6
-
7
3
  module BrokenLinkFinder
8
4
  DEFAULT_MAX_THREADS = 100
9
5
 
@@ -13,7 +9,7 @@ module BrokenLinkFinder
13
9
  end
14
10
 
15
11
  class Finder
16
- attr_reader :sort, :broken_links, :ignored_links, :total_links_crawled, :max_threads
12
+ attr_reader :sort, :max_threads, :broken_links, :ignored_links, :crawl_stats
17
13
 
18
14
  # Creates a new Finder instance.
19
15
  def initialize(sort: :page, max_threads: BrokenLinkFinder::DEFAULT_MAX_THREADS)
@@ -25,35 +21,38 @@ module BrokenLinkFinder
25
21
  @lock = Mutex.new
26
22
  @crawler = Wgit::Crawler.new
27
23
 
28
- clear_links
24
+ reset_crawl
29
25
  end
30
26
 
31
27
  # Clear/empty the link collection Hashes.
32
- def clear_links
28
+ def reset_crawl
33
29
  @broken_links = {}
34
30
  @ignored_links = {}
35
- @total_links_crawled = 0
36
- @all_broken_links = Set.new
37
- @all_intact_links = Set.new
31
+ @all_broken_links = Set.new # Used to prevent crawling a link twice.
32
+ @all_intact_links = Set.new # "
33
+ @broken_link_map = {} # Maps a link to its absolute form.
34
+ @crawl_stats = {} # Records crawl stats e.g. duration etc.
38
35
  end
39
36
 
40
37
  # Finds broken links within a single page and appends them to the
41
38
  # @broken_links array. Returns true if at least one broken link was found.
42
39
  # Access the broken links afterwards with Finder#broken_links.
43
40
  def crawl_url(url)
44
- clear_links
41
+ reset_crawl
45
42
 
46
- url = url.to_url
47
- doc = @crawler.crawl(url)
43
+ start = Time.now
44
+ url = url.to_url
45
+ doc = @crawler.crawl(url)
48
46
 
49
47
  # Ensure the given page url is valid.
50
48
  raise "Invalid or broken URL: #{url}" unless doc
51
49
 
52
50
  # Get all page links and determine which are broken.
53
51
  find_broken_links(doc)
52
+ retry_broken_links
54
53
 
55
54
  sort_links
56
- set_total_links_crawled
55
+ set_crawl_stats(url: url, pages_crawled: [url], start: start)
57
56
 
58
57
  @broken_links.any?
59
58
  end
@@ -63,15 +62,16 @@ module BrokenLinkFinder
63
62
  # at least one broken link was found and an Array of all pages crawled.
64
63
  # Access the broken links afterwards with Finder#broken_links.
65
64
  def crawl_site(url)
66
- clear_links
65
+ reset_crawl
67
66
 
68
- url = url.to_url
69
- pool = Thread.pool(@max_threads)
70
- crawled_pages = []
67
+ start = Time.now
68
+ url = url.to_url
69
+ pool = Thread.pool(@max_threads)
70
+ crawled = Set.new
71
71
 
72
72
  # Crawl the site's HTML web pages looking for links.
73
73
  externals = @crawler.crawl_site(url) do |doc|
74
- crawled_pages << doc.url
74
+ crawled << doc.url
75
75
  next unless doc
76
76
 
77
77
  # Start a thread for each page, checking for broken links.
@@ -83,30 +83,31 @@ module BrokenLinkFinder
83
83
 
84
84
  # Wait for all threads to finish.
85
85
  pool.shutdown
86
+ retry_broken_links
86
87
 
87
88
  sort_links
88
- set_total_links_crawled
89
+ set_crawl_stats(url: url, pages_crawled: crawled.to_a, start: start)
89
90
 
90
- [@broken_links.any?, crawled_pages.uniq]
91
+ @broken_links.any?
91
92
  end
92
93
 
93
94
  # Pretty prints the link report into a stream e.g. STDOUT or a file,
94
95
  # anything that respond_to? :puts. Defaults to STDOUT.
95
- # Returns true if there were broken links and vice versa.
96
- def pretty_print_link_report(
97
- stream = STDOUT,
98
- broken_verbose: true,
99
- ignored_verbose: false
100
- )
101
- reporter = BrokenLinkFinder::Reporter.new(
102
- stream, @sort, @broken_links, @ignored_links
103
- )
104
- reporter.pretty_print_link_report(
105
- broken_verbose: broken_verbose,
106
- ignored_verbose: ignored_verbose
107
- )
108
-
109
- @broken_links.any?
96
+ def report(stream = STDOUT,
97
+ type: :text, broken_verbose: true, ignored_verbose: false)
98
+ klass = case type
99
+ when :text
100
+ BrokenLinkFinder::TextReporter
101
+ when :html
102
+ BrokenLinkFinder::HTMLReporter
103
+ else
104
+ raise "type: must be :text or :html, not: :#{type}"
105
+ end
106
+
107
+ reporter = klass.new(stream, @sort, @broken_links,
108
+ @ignored_links, @broken_link_map, @crawl_stats)
109
+ reporter.call(broken_verbose: broken_verbose,
110
+ ignored_verbose: ignored_verbose)
110
111
  end
111
112
 
112
113
  private
@@ -117,11 +118,11 @@ module BrokenLinkFinder
117
118
 
118
119
  # Iterate over the supported links checking if they're broken or not.
119
120
  links.each do |link|
120
- # Check if the link has already been processed previously.
121
+ # Skip if the link has been processed previously.
121
122
  next if @all_intact_links.include?(link)
122
123
 
123
124
  if @all_broken_links.include?(link)
124
- append_broken_link(page.url, link)
125
+ append_broken_link(page.url, link) # Record on which page.
125
126
  next
126
127
  end
127
128
 
@@ -129,10 +130,8 @@ module BrokenLinkFinder
129
130
  link_doc = crawl_link(page, link)
130
131
 
131
132
  # Determine if the crawled link is broken or not.
132
- if link_doc.nil? ||
133
- @crawler.last_response.not_found? ||
134
- has_broken_anchor(link_doc)
135
- append_broken_link(page.url, link)
133
+ if link_broken?(link_doc)
134
+ append_broken_link(page.url, link, doc: page)
136
135
  else
137
136
  @lock.synchronize { @all_intact_links << link }
138
137
  end
@@ -141,6 +140,17 @@ module BrokenLinkFinder
141
140
  nil
142
141
  end
143
142
 
143
+ # Implements a retry mechanism for each of the broken links found.
144
+ # Removes any broken links found to be working OK.
145
+ def retry_broken_links
146
+ sleep(0.5) # Give the servers a break, then retry the links.
147
+
148
+ @broken_link_map.each do |link, href|
149
+ doc = @crawler.crawl(href)
150
+ remove_broken_link(link) unless link_broken?(doc)
151
+ end
152
+ end
153
+
144
154
  # Report and reject any non supported links. Any link that is absolute and
145
155
  # doesn't start with 'http' is unsupported e.g. 'mailto:blah' etc.
146
156
  def get_supported_links(doc)
@@ -153,12 +163,17 @@ module BrokenLinkFinder
153
163
  end
154
164
  end
155
165
 
156
- # Makes the link absolute and crawls it, returning its Wgit::Document.
166
+ # Make the link absolute and crawl it, returning its Wgit::Document.
157
167
  def crawl_link(doc, link)
158
168
  link = link.prefix_base(doc)
159
169
  @crawler.crawl(link)
160
170
  end
161
171
 
172
+ # Return if the crawled link is broken or not.
173
+ def link_broken?(doc)
174
+ doc.nil? || @crawler.last_response.not_found? || has_broken_anchor(doc)
175
+ end
176
+
162
177
  # Returns true if the link is/contains a broken anchor/fragment.
163
178
  def has_broken_anchor(doc)
164
179
  raise 'link document is nil' unless doc
@@ -170,7 +185,8 @@ module BrokenLinkFinder
170
185
  end
171
186
 
172
187
  # Append key => [value] to @broken_links.
173
- def append_broken_link(url, link)
188
+ # If doc: is provided then the link will be recorded in absolute form.
189
+ def append_broken_link(url, link, doc: nil)
174
190
  key, value = get_key_value(url, link)
175
191
 
176
192
  @lock.synchronize do
@@ -178,6 +194,23 @@ module BrokenLinkFinder
178
194
  @broken_links[key] << value
179
195
 
180
196
  @all_broken_links << link
197
+
198
+ @broken_link_map[link] = link.prefix_base(doc) if doc
199
+ end
200
+ end
201
+
202
+ # Remove the broken_link from the necessary collections.
203
+ def remove_broken_link(link)
204
+ @lock.synchronize do
205
+ if @sort == :page
206
+ @broken_links.each { |_k, links| links.delete(link) }
207
+ @broken_links.delete_if { |_k, links| links.empty? }
208
+ else
209
+ @broken_links.delete(link)
210
+ end
211
+
212
+ @all_broken_links.delete(link)
213
+ @all_intact_links << link
181
214
  end
182
215
  end
183
216
 
@@ -217,12 +250,15 @@ module BrokenLinkFinder
217
250
  end
218
251
 
219
252
  # Sets and returns the total number of links crawled.
220
- def set_total_links_crawled
221
- @total_links_crawled = @all_broken_links.size + @all_intact_links.size
253
+ def set_crawl_stats(url:, pages_crawled:, start:)
254
+ @crawl_stats[:url] = url
255
+ @crawl_stats[:pages_crawled] = pages_crawled
256
+ @crawl_stats[:num_pages] = pages_crawled.size
257
+ @crawl_stats[:num_links] = @all_broken_links.size + @all_intact_links.size
258
+ @crawl_stats[:duration] = Time.now - start
222
259
  end
223
260
 
224
- alias crawl_page crawl_url
225
- alias crawl_r crawl_site
226
- alias pretty_print_link_summary pretty_print_link_report
261
+ alias crawl_page crawl_url
262
+ alias crawl_r crawl_site
227
263
  end
228
264
  end
@@ -0,0 +1,134 @@
1
+ # frozen_string_literal: true
2
+
3
+ module BrokenLinkFinder
4
+ class HTMLReporter < Reporter
5
+ # Creates a new HTMLReporter instance.
6
+ # stream is any Object that responds to :puts and :print.
7
+ def initialize(stream, sort,
8
+ broken_links, ignored_links,
9
+ broken_link_map, crawl_stats)
10
+ super
11
+ end
12
+
13
+ # Pretty print a report detailing the full link summary.
14
+ def call(broken_verbose: true, ignored_verbose: false)
15
+ puts '<div class="broken_link_finder_report">'
16
+
17
+ report_crawl_summary
18
+ report_broken_links(verbose: broken_verbose)
19
+ report_ignored_links(verbose: ignored_verbose)
20
+
21
+ puts '</div>'
22
+
23
+ nil
24
+ end
25
+
26
+ private
27
+
28
+ # Report a summary of the overall crawl.
29
+ def report_crawl_summary
30
+ puts format(
31
+ '<p class="crawl_summary">Crawled %s (%s page(s) in %s seconds)</p>',
32
+ @crawl_stats[:url],
33
+ @crawl_stats[:num_pages],
34
+ @crawl_stats[:duration]&.truncate(2)
35
+ )
36
+ end
37
+
38
+ # Report a summary of the broken links.
39
+ def report_broken_links(verbose: true)
40
+ puts '<div class="broken_links">'
41
+
42
+ if @broken_links.empty?
43
+ puts_summary 'Good news, there are no broken links!', type: :broken
44
+ else
45
+ num_pages, num_links = get_hash_stats(@broken_links)
46
+ puts_summary "Found #{num_links} broken link(s) across #{num_pages} page(s):", type: :broken
47
+
48
+ @broken_links.each do |key, values|
49
+ puts_group(key, type: :broken) # Puts the opening <p> element.
50
+
51
+ if verbose || (values.length <= NUM_VALUES)
52
+ values.each { |value| puts_group_item value, type: :broken }
53
+ else # Only print N values and summarise the rest.
54
+ NUM_VALUES.times { |i| puts_group_item values[i], type: :broken }
55
+
56
+ objects = sort_by_page? ? 'link(s)' : 'page(s)'
57
+ puts "+ #{values.length - NUM_VALUES} other #{objects}, remove --concise to see them all<br />"
58
+ end
59
+
60
+ puts '</p>'
61
+ end
62
+ end
63
+
64
+ puts '</div>'
65
+ end
66
+
67
+ # Report a summary of the ignored links.
68
+ def report_ignored_links(verbose: false)
69
+ puts '<div class="ignored_links">'
70
+
71
+ if @ignored_links.any?
72
+ num_pages, num_links = get_hash_stats(@ignored_links)
73
+ puts_summary "Ignored #{num_links} unsupported link(s) across #{num_pages} page(s), which you should check manually:", type: :ignored
74
+
75
+ @ignored_links.each do |key, values|
76
+ puts_group(key, type: :ignored) # Puts the opening <p> element.
77
+
78
+ if verbose || (values.length <= NUM_VALUES)
79
+ values.each { |value| puts_group_item value, type: :ignored }
80
+ else # Only print N values and summarise the rest.
81
+ NUM_VALUES.times { |i| puts_group_item values[i], type: :ignored }
82
+
83
+ objects = sort_by_page? ? 'link(s)' : 'page(s)'
84
+ puts "+ #{values.length - NUM_VALUES} other #{objects}, use --verbose to see them all<br />"
85
+ end
86
+
87
+ puts '</p>'
88
+ end
89
+ end
90
+
91
+ puts '</div>'
92
+ end
93
+
94
+ def puts_summary(text, type:)
95
+ klass = (type == :broken) ? 'broken_links_summary' : 'ignored_links_summary'
96
+ puts "<p class=\"#{klass}\">#{text}</p>"
97
+ end
98
+
99
+ def puts_group(link, type:)
100
+ href = build_url(link)
101
+ a_element = "<a href=\"#{href}\">#{link}</a>"
102
+
103
+ case type
104
+ when :broken
105
+ msg = sort_by_page? ?
106
+ "The following broken links were found on '#{a_element}':" :
107
+ "The broken link '#{a_element}' was found on the following pages:"
108
+ klass = 'broken_links_group'
109
+ when :ignored
110
+ msg = sort_by_page? ?
111
+ "The following links were ignored on '#{a_element}':" :
112
+ "The link '#{a_element}' was ignored on the following pages:"
113
+ klass = 'ignored_links_group'
114
+ else
115
+ raise "type: must be :broken or :ignored, not: #{type}"
116
+ end
117
+
118
+ puts "<p class=\"#{klass}\">"
119
+ puts msg + '<br />'
120
+ end
121
+
122
+ def puts_group_item(value, type:)
123
+ klass = (type == :broken) ? 'broken_links_group_item' : 'ignored_links_group_item'
124
+ puts "<a class=\"#{klass}\" href=\"#{build_url(value)}\">#{value}</a><br />"
125
+ end
126
+
127
+ def build_url(link)
128
+ return link if link.to_url.absolute?
129
+ @broken_link_map.fetch(link)
130
+ end
131
+
132
+ alias_method :report, :call
133
+ end
134
+ end
@@ -0,0 +1,77 @@
1
+ # frozen_string_literal: true
2
+
3
+ module BrokenLinkFinder
4
+ # Generic reporter class to be inherited from by format specific reporters.
5
+ class Reporter
6
+ # The amount of pages/links to display when verbose is false.
7
+ NUM_VALUES = 3
8
+
9
+ # Creates a new Reporter instance.
10
+ # stream is any Object that responds to :puts and :print.
11
+ def initialize(stream, sort,
12
+ broken_links, ignored_links,
13
+ broken_link_map, crawl_stats)
14
+ unless stream.respond_to?(:puts) && stream.respond_to?(:print)
15
+ raise 'stream must respond_to? :puts and :print'
16
+ end
17
+ raise "sort by either :page or :link, not #{sort}" \
18
+ unless %i[page link].include?(sort)
19
+
20
+ @stream = stream
21
+ @sort = sort
22
+ @broken_links = broken_links
23
+ @ignored_links = ignored_links
24
+ @broken_link_map = broken_link_map
25
+ @crawl_stats = crawl_stats
26
+ end
27
+
28
+ # Pretty print a report detailing the full link summary.
29
+ def call(broken_verbose: true, ignored_verbose: false)
30
+ raise 'Not implemented by parent class'
31
+ end
32
+
33
+ protected
34
+
35
+ # Return true if the sort is by page.
36
+ def sort_by_page?
37
+ @sort == :page
38
+ end
39
+
40
+ # Returns the key/value statistics of hash e.g. the number of keys and
41
+ # combined values. The hash should be of the format: { 'str' => [...] }.
42
+ # Use like: `num_pages, num_links = get_hash_stats(links)`.
43
+ def get_hash_stats(hash)
44
+ num_keys = hash.keys.length
45
+ values = hash.values.flatten
46
+ num_values = sort_by_page? ? values.length : values.uniq.length
47
+
48
+ sort_by_page? ?
49
+ [num_keys, num_values] :
50
+ [num_values, num_keys]
51
+ end
52
+
53
+ # Prints the text. Defaults to a blank line.
54
+ def print(text = '')
55
+ @stream.print(text)
56
+ end
57
+
58
+ # Prints the text + \n. Defaults to a blank line.
59
+ def puts(text = '')
60
+ @stream.puts(text)
61
+ end
62
+
63
+ # Prints text + \n\n.
64
+ def putsn(text)
65
+ puts(text)
66
+ puts
67
+ end
68
+
69
+ # Prints \n + text + \n.
70
+ def nputs(text)
71
+ puts
72
+ puts(text)
73
+ end
74
+
75
+ alias_method :report, :call
76
+ end
77
+ end
@@ -0,0 +1,86 @@
1
+ # frozen_string_literal: true
2
+
3
+ module BrokenLinkFinder
4
+ class TextReporter < Reporter
5
+ # Creates a new TextReporter instance.
6
+ # stream is any Object that responds to :puts and :print.
7
+ def initialize(stream, sort,
8
+ broken_links, ignored_links,
9
+ broken_link_map, crawl_stats)
10
+ super
11
+ end
12
+
13
+ # Pretty print a report detailing the full link summary.
14
+ def call(broken_verbose: true, ignored_verbose: false)
15
+ report_crawl_summary
16
+ report_broken_links(verbose: broken_verbose)
17
+ report_ignored_links(verbose: ignored_verbose)
18
+
19
+ nil
20
+ end
21
+
22
+ private
23
+
24
+ # Report a summary of the overall crawl.
25
+ def report_crawl_summary
26
+ putsn format(
27
+ 'Crawled %s (%s page(s) in %s seconds)',
28
+ @crawl_stats[:url],
29
+ @crawl_stats[:num_pages],
30
+ @crawl_stats[:duration]&.truncate(2)
31
+ )
32
+ end
33
+
34
+ # Report a summary of the broken links.
35
+ def report_broken_links(verbose: true)
36
+ if @broken_links.empty?
37
+ puts 'Good news, there are no broken links!'
38
+ else
39
+ num_pages, num_links = get_hash_stats(@broken_links)
40
+ puts "Found #{num_links} broken link(s) across #{num_pages} page(s):"
41
+
42
+ @broken_links.each do |key, values|
43
+ msg = sort_by_page? ?
44
+ "The following broken links were found on '#{key}':" :
45
+ "The broken link '#{key}' was found on the following pages:"
46
+ nputs msg
47
+
48
+ if verbose || (values.length <= NUM_VALUES)
49
+ values.each { |value| puts value }
50
+ else # Only print N values and summarise the rest.
51
+ NUM_VALUES.times { |i| puts values[i] }
52
+
53
+ objects = sort_by_page? ? 'link(s)' : 'page(s)'
54
+ puts "+ #{values.length - NUM_VALUES} other #{objects}, remove --concise to see them all"
55
+ end
56
+ end
57
+ end
58
+ end
59
+
60
+ # Report a summary of the ignored links.
61
+ def report_ignored_links(verbose: false)
62
+ if @ignored_links.any?
63
+ num_pages, num_links = get_hash_stats(@ignored_links)
64
+ nputs "Ignored #{num_links} unsupported link(s) across #{num_pages} page(s), which you should check manually:"
65
+
66
+ @ignored_links.each do |key, values|
67
+ msg = sort_by_page? ?
68
+ "The following links were ignored on '#{key}':" :
69
+ "The link '#{key}' was ignored on the following pages:"
70
+ nputs msg
71
+
72
+ if verbose || (values.length <= NUM_VALUES)
73
+ values.each { |value| puts value }
74
+ else # Only print N values and summarise the rest.
75
+ NUM_VALUES.times { |i| puts values[i] }
76
+
77
+ objects = sort_by_page? ? 'link(s)' : 'page(s)'
78
+ puts "+ #{values.length - NUM_VALUES} other #{objects}, use --verbose to see them all"
79
+ end
80
+ end
81
+ end
82
+ end
83
+
84
+ alias_method :report, :call
85
+ end
86
+ end
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module BrokenLinkFinder
4
- VERSION = '0.9.5'
4
+ VERSION = '0.10.0'
5
5
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: broken_link_finder
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.9.5
4
+ version: 0.10.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Michael Telford
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2019-11-22 00:00:00.000000000 Z
11
+ date: 2019-11-28 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: bundler
@@ -159,7 +159,9 @@ files:
159
159
  - exe/broken_link_finder
160
160
  - lib/broken_link_finder.rb
161
161
  - lib/broken_link_finder/finder.rb
162
- - lib/broken_link_finder/reporter.rb
162
+ - lib/broken_link_finder/reporter/html_reporter.rb
163
+ - lib/broken_link_finder/reporter/reporter.rb
164
+ - lib/broken_link_finder/reporter/text_reporter.rb
163
165
  - lib/broken_link_finder/version.rb
164
166
  - lib/broken_link_finder/wgit_extensions.rb
165
167
  - load.rb
@@ -187,8 +189,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
187
189
  - !ruby/object:Gem::Version
188
190
  version: '0'
189
191
  requirements: []
190
- rubyforge_project:
191
- rubygems_version: 2.7.6
192
+ rubygems_version: 3.0.6
192
193
  signing_key:
193
194
  specification_version: 4
194
195
  summary: Finds a website's broken links and reports back to you with a summary.
@@ -1,116 +0,0 @@
1
- # frozen_string_literal: true
2
-
3
- module BrokenLinkFinder
4
- class Reporter
5
- # The amount of pages/links to display when verbose is false.
6
- NUM_VALUES = 3
7
-
8
- # Creates a new Reporter instance.
9
- # stream is any Object that responds to :puts.
10
- def initialize(stream, sort, broken_links, ignored_links)
11
- raise 'stream must respond_to? :puts' unless stream.respond_to?(:puts)
12
- raise "sort by either :page or :link, not #{sort}" \
13
- unless %i[page link].include?(sort)
14
-
15
- @stream = stream
16
- @sort = sort
17
- @broken_links = broken_links
18
- @ignored_links = ignored_links
19
- end
20
-
21
- # Pretty print a report detailing the link summary.
22
- def pretty_print_link_report(broken_verbose: true, ignored_verbose: false)
23
- report_broken_links(verbose: broken_verbose)
24
- report_ignored_links(verbose: ignored_verbose)
25
-
26
- nil
27
- end
28
-
29
- private
30
-
31
- # Report a summary of the broken links.
32
- def report_broken_links(verbose: true)
33
- if @broken_links.empty?
34
- print 'Good news, there are no broken links!'
35
- else
36
- num_pages, num_links = get_hash_stats(@broken_links)
37
- print "Found #{num_links} broken link(s) across #{num_pages} page(s):"
38
-
39
- @broken_links.each do |key, values|
40
- msg = sort_by_page? ?
41
- "The following broken links were found on '#{key}':" :
42
- "The broken link '#{key}' was found on the following pages:"
43
- nprint msg
44
-
45
- if verbose || (values.length <= NUM_VALUES)
46
- values.each { |value| print value }
47
- else # Only print N values and summarise the rest.
48
- NUM_VALUES.times { |i| print values[i] }
49
-
50
- objects = sort_by_page? ? 'link(s)' : 'page(s)'
51
- print "+ #{values.length - NUM_VALUES} other #{objects}, remove --concise to see them all"
52
- end
53
- end
54
- end
55
- end
56
-
57
- # Report a summary of the ignored links.
58
- def report_ignored_links(verbose: false)
59
- if @ignored_links.any?
60
- num_pages, num_links = get_hash_stats(@ignored_links)
61
- nprint "Ignored #{num_links} unsupported link(s) across #{num_pages} page(s), which you should check manually:"
62
-
63
- @ignored_links.each do |key, values|
64
- msg = sort_by_page? ?
65
- "The following links were ignored on '#{key}':" :
66
- "The link '#{key}' was ignored on the following pages:"
67
- nprint msg
68
-
69
- if verbose || (values.length <= NUM_VALUES)
70
- values.each { |value| print value }
71
- else # Only print N values and summarise the rest.
72
- NUM_VALUES.times { |i| print values[i] }
73
-
74
- objects = sort_by_page? ? 'link(s)' : 'page(s)'
75
- print "+ #{values.length - NUM_VALUES} other #{objects}, use --verbose to see them all"
76
- end
77
- end
78
- end
79
- end
80
-
81
- # Return true if the sort is by page.
82
- def sort_by_page?
83
- @sort == :page
84
- end
85
-
86
- # Returns the key/value statistics of hash e.g. the number of keys and
87
- # combined values. The hash should be of the format: { 'str' => [...] }.
88
- # Use like: `num_pages, num_links = get_hash_stats(links)`.
89
- def get_hash_stats(hash)
90
- num_keys = hash.keys.length
91
- values = hash.values.flatten
92
- num_values = sort_by_page? ? values.length : values.uniq.length
93
-
94
- sort_by_page? ?
95
- [num_keys, num_values] :
96
- [num_values, num_keys]
97
- end
98
-
99
- # Prints the text + \n. Defaults to a blank line.
100
- def print(text = '')
101
- @stream.puts(text)
102
- end
103
-
104
- # Prints text + \n\n.
105
- def printn(text)
106
- print(text)
107
- print
108
- end
109
-
110
- # Prints \n + text + \n.
111
- def nprint(text)
112
- print
113
- print(text)
114
- end
115
- end
116
- end