broken_link_finder 0.9.5 → 0.10.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/CHANGELOG.md +11 -0
- data/Gemfile.lock +3 -3
- data/README.md +18 -14
- data/benchmark.rb +9 -5
- data/bin/console +4 -14
- data/exe/broken_link_finder +5 -2
- data/lib/broken_link_finder.rb +5 -1
- data/lib/broken_link_finder/finder.rb +85 -49
- data/lib/broken_link_finder/reporter/html_reporter.rb +134 -0
- data/lib/broken_link_finder/reporter/reporter.rb +77 -0
- data/lib/broken_link_finder/reporter/text_reporter.rb +86 -0
- data/lib/broken_link_finder/version.rb +1 -1
- metadata +6 -5
- data/lib/broken_link_finder/reporter.rb +0 -116
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 7a53784c1bd2f75c18b3492ea782b4cc2e229a94f89afcf33b60ef633512554e
|
4
|
+
data.tar.gz: 393dca220b7f00d72314c93e7b877e0412afdf784fa2e563bbecb2dc6c6b29f7
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: c0d304e5b0a9258265c5c084c0a6e5819c169ba8eb02b3c6317a37784a9ca12982b0fc520c3cca1060fde60126ee936708d7891c69133c5d72c9c0287a79b3f5
|
7
|
+
data.tar.gz: c21a4aec2c077e2617fb625debad28f746148ad98229a27a590a4412601e30759c709aa3a6e6d80e81c16160e16968fc0392181fc9c75e4da06578452f7c5ab6
|
data/CHANGELOG.md
CHANGED
@@ -9,6 +9,17 @@
|
|
9
9
|
- ...
|
10
10
|
---
|
11
11
|
|
12
|
+
## v0.10.0
|
13
|
+
### Added
|
14
|
+
- A `--html` flag to the `crawl` executable command which produces a HTML report (instead of text).
|
15
|
+
- Added a 'retry' mechanism for any broken links found. This is essentially a verification step before generating a report.
|
16
|
+
- `Finder#crawl_stats` for info such as crawl duration, total links crawled etc.
|
17
|
+
### Changed/Removed
|
18
|
+
- The API has changed somewhat. See the [docs](https://www.rubydoc.info/gems/broken_link_finder) for the up to date code signatures if you're using `broken_link_finder` outside of its executable.
|
19
|
+
### Fixed
|
20
|
+
- ...
|
21
|
+
---
|
22
|
+
|
12
23
|
## v0.9.5
|
13
24
|
### Added
|
14
25
|
- ...
|
data/Gemfile.lock
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
PATH
|
2
2
|
remote: .
|
3
3
|
specs:
|
4
|
-
broken_link_finder (0.
|
4
|
+
broken_link_finder (0.10.0)
|
5
5
|
thor (~> 0.20)
|
6
6
|
thread (~> 0.2)
|
7
7
|
wgit (~> 0.5)
|
@@ -18,7 +18,7 @@ GEM
|
|
18
18
|
safe_yaml (~> 1.0.0)
|
19
19
|
ethon (0.12.0)
|
20
20
|
ffi (>= 1.3.0)
|
21
|
-
ffi (1.11.
|
21
|
+
ffi (1.11.3)
|
22
22
|
hashdiff (1.0.0)
|
23
23
|
maxitest (3.4.0)
|
24
24
|
minitest (>= 5.0.0, < 5.13.0)
|
@@ -65,4 +65,4 @@ RUBY VERSION
|
|
65
65
|
ruby 2.5.3p105
|
66
66
|
|
67
67
|
BUNDLED WITH
|
68
|
-
2.0.
|
68
|
+
2.0.2
|
data/README.md
CHANGED
@@ -57,7 +57,7 @@ Installing this gem installs the `broken_link_finder` executable into your `$PAT
|
|
57
57
|
|
58
58
|
$ broken_link_finder crawl http://txti.es
|
59
59
|
|
60
|
-
Adding the
|
60
|
+
Adding the `--recursive` flag would crawl the entire `txti.es` site, not just its index page.
|
61
61
|
|
62
62
|
See the [output](#Output) section below for an example of a site with broken links.
|
63
63
|
|
@@ -76,7 +76,7 @@ require 'broken_link_finder'
|
|
76
76
|
|
77
77
|
finder = BrokenLinkFinder.new
|
78
78
|
finder.crawl_site 'http://txti.es' # Or use Finder#crawl_page for a single webpage.
|
79
|
-
finder.
|
79
|
+
finder.report # Or use Finder#broken_links and Finder#ignored_links
|
80
80
|
# for direct access to the link Hashes.
|
81
81
|
```
|
82
82
|
|
@@ -91,13 +91,15 @@ See the full source code documentation [here](https://www.rubydoc.info/gems/brok
|
|
91
91
|
If broken links are found then the output will look something like:
|
92
92
|
|
93
93
|
```text
|
94
|
+
Crawled http://txti.es (7 page(s) in 7.88 seconds)
|
95
|
+
|
94
96
|
Found 6 broken link(s) across 2 page(s):
|
95
97
|
|
96
98
|
The following broken links were found on 'http://txti.es/about':
|
97
99
|
http://twitter.com/thebarrytone
|
100
|
+
/doesntexist
|
98
101
|
http://twitter.com/nwbld
|
99
|
-
|
100
|
-
https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=84L4BDS86FBUU
|
102
|
+
twitter.com/txties
|
101
103
|
|
102
104
|
The following broken links were found on 'http://txti.es/how':
|
103
105
|
http://en.wikipedia.org/wiki/Markdown
|
@@ -105,14 +107,16 @@ http://imgur.com
|
|
105
107
|
|
106
108
|
Ignored 3 unsupported link(s) across 2 page(s), which you should check manually:
|
107
109
|
|
108
|
-
The following links were ignored on http://txti.es:
|
110
|
+
The following links were ignored on 'http://txti.es':
|
109
111
|
tel:+13174562564
|
110
112
|
mailto:big.jim@jmail.com
|
111
113
|
|
112
|
-
The following links were ignored on http://txti.es/contact:
|
114
|
+
The following links were ignored on 'http://txti.es/contact':
|
113
115
|
ftp://server.com
|
114
116
|
```
|
115
117
|
|
118
|
+
You can provide the `--html` flag if you'd prefer a HTML based report.
|
119
|
+
|
116
120
|
## Contributing
|
117
121
|
|
118
122
|
Bug reports and feature requests are welcome on [GitHub](https://github.com/michaeltelford/broken-link-finder). Just raise an issue.
|
@@ -128,11 +132,11 @@ After checking out the repo, run `bin/setup` to install dependencies. Then, run
|
|
128
132
|
To install this gem onto your local machine, run `bundle exec rake install`.
|
129
133
|
|
130
134
|
To release a new gem version:
|
131
|
-
- Update the deps in the `*.gemspec
|
132
|
-
- Update the version number in `version.rb` and add the new version to the `CHANGELOG
|
133
|
-
- Run `bundle install
|
134
|
-
- Run `bundle exec rake test` ensuring all tests pass
|
135
|
-
- Run `bundle exec rake compile` ensuring no warnings
|
136
|
-
- Run `bundle exec rake install && rbenv rehash
|
137
|
-
- Manually test the executable
|
138
|
-
- Run `bundle exec rake release[origin]
|
135
|
+
- Update the deps in the `*.gemspec`, if necessary.
|
136
|
+
- Update the version number in `version.rb` and add the new version to the `CHANGELOG`.
|
137
|
+
- Run `bundle install`.
|
138
|
+
- Run `bundle exec rake test` ensuring all tests pass.
|
139
|
+
- Run `bundle exec rake compile` ensuring no warnings.
|
140
|
+
- Run `bundle exec rake install && rbenv rehash`.
|
141
|
+
- Manually test the executable.
|
142
|
+
- Run `bundle exec rake release[origin]`.
|
data/benchmark.rb
CHANGED
@@ -10,15 +10,19 @@ finder = BrokenLinkFinder::Finder.new
|
|
10
10
|
puts Benchmark.measure { finder.crawl_site url }
|
11
11
|
puts "Links crawled: #{finder.total_links_crawled}"
|
12
12
|
|
13
|
-
# http://txti.es page crawl
|
14
|
-
# Pre
|
15
|
-
# Post
|
13
|
+
# http://txti.es page crawl with threading
|
14
|
+
# Pre: 17.5 seconds
|
15
|
+
# Post: 7.5 seconds
|
16
16
|
|
17
|
-
# http://txti.es
|
17
|
+
# http://txti.es with threading - page vs site crawl
|
18
18
|
# Page: 9.526981
|
19
19
|
# Site: 9.732416
|
20
20
|
# Multi-threading crawl_site now yields the same time as a single page
|
21
21
|
|
22
|
-
# Large site crawl -
|
22
|
+
# Large site crawl - all link recording functionality
|
23
23
|
# Pre: 608 seconds with 7665 links crawled
|
24
24
|
# Post: 355 seconds with 1099 links crawled
|
25
|
+
|
26
|
+
# Large site crawl - retry mechanism
|
27
|
+
# Pre: 140 seconds
|
28
|
+
# Post: 170 seconds
|
data/bin/console
CHANGED
@@ -5,20 +5,10 @@ require 'bundler/setup'
|
|
5
5
|
require 'pry'
|
6
6
|
require 'byebug'
|
7
7
|
require 'broken_link_finder'
|
8
|
+
require 'logger'
|
8
9
|
|
9
|
-
#
|
10
|
-
|
11
|
-
singleton_class.class_eval do
|
12
|
-
alias_method :orig_get, :get
|
13
|
-
end
|
14
|
-
|
15
|
-
def self.get(base_url, options = {})
|
16
|
-
puts "[typhoeus] Sending GET: #{base_url}"
|
17
|
-
resp = orig_get(base_url, options)
|
18
|
-
puts "[typhoeus] Status: #{resp.code} (#{resp.body.length} bytes in #{resp.total_time} seconds)"
|
19
|
-
resp
|
20
|
-
end
|
21
|
-
end
|
10
|
+
# Logs all HTTP requests.
|
11
|
+
Wgit.logger.level = Logger::DEBUG
|
22
12
|
|
23
13
|
# Call reload to load all recent code changes.
|
24
14
|
def reload
|
@@ -39,6 +29,6 @@ by_link = Finder.new sort: :link
|
|
39
29
|
finder = by_page
|
40
30
|
|
41
31
|
# Start the console.
|
42
|
-
puts "\nbroken_link_finder v#{BrokenLinkFinder::VERSION}"
|
32
|
+
puts "\nbroken_link_finder v#{BrokenLinkFinder::VERSION} (#{Wgit.version_str})"
|
43
33
|
|
44
34
|
binding.pry
|
data/exe/broken_link_finder
CHANGED
@@ -9,12 +9,14 @@ class BrokenLinkFinderCLI < Thor
|
|
9
9
|
desc 'crawl [URL]', 'Find broken links at the URL'
|
10
10
|
option :recursive, type: :boolean, aliases: [:r], default: false, desc: 'Crawl the entire site.'
|
11
11
|
option :threads, type: :numeric, aliases: [:t], default: BrokenLinkFinder::DEFAULT_MAX_THREADS, desc: 'Max number of threads to use when crawling recursively; 1 thread per web page.'
|
12
|
+
option :html, type: :boolean, aliases: [:h], default: false, desc: 'Produce a HTML report (instead of text)'
|
12
13
|
option :sort_by_link, type: :boolean, aliases: [:l], default: false, desc: 'Makes report more concise if there are more pages crawled than broken links found. Use with -r on medium/large sites.'
|
13
14
|
option :verbose, type: :boolean, aliases: [:v], default: false, desc: 'Display all ignored links.'
|
14
15
|
option :concise, type: :boolean, aliases: [:c], default: false, desc: 'Display only a summary of broken links.'
|
15
16
|
def crawl(url)
|
16
17
|
url = "http://#{url}" unless url.start_with?('http')
|
17
18
|
|
19
|
+
report_type = options[:html] ? :html : :text
|
18
20
|
sort_by = options[:sort_by_link] ? :link : :page
|
19
21
|
max_threads = options[:threads]
|
20
22
|
broken_verbose = !options[:concise]
|
@@ -22,8 +24,9 @@ class BrokenLinkFinderCLI < Thor
|
|
22
24
|
|
23
25
|
finder = BrokenLinkFinder::Finder.new(sort: sort_by, max_threads: max_threads)
|
24
26
|
options[:recursive] ? finder.crawl_site(url) : finder.crawl_page(url)
|
25
|
-
finder.
|
26
|
-
|
27
|
+
finder.report(
|
28
|
+
type: report_type,
|
29
|
+
broken_verbose: broken_verbose,
|
27
30
|
ignored_verbose: ignored_verbose
|
28
31
|
)
|
29
32
|
rescue Exception => e
|
data/lib/broken_link_finder.rb
CHANGED
@@ -2,8 +2,12 @@
|
|
2
2
|
|
3
3
|
require 'wgit'
|
4
4
|
require 'wgit/core_ext'
|
5
|
+
require 'thread/pool'
|
6
|
+
require 'set'
|
5
7
|
|
6
8
|
require_relative './broken_link_finder/wgit_extensions'
|
7
9
|
require_relative './broken_link_finder/version'
|
8
|
-
require_relative './broken_link_finder/reporter'
|
10
|
+
require_relative './broken_link_finder/reporter/reporter'
|
11
|
+
require_relative './broken_link_finder/reporter/text_reporter'
|
12
|
+
require_relative './broken_link_finder/reporter/html_reporter'
|
9
13
|
require_relative './broken_link_finder/finder'
|
@@ -1,9 +1,5 @@
|
|
1
1
|
# frozen_string_literal: true
|
2
2
|
|
3
|
-
require_relative 'reporter'
|
4
|
-
require 'thread/pool'
|
5
|
-
require 'set'
|
6
|
-
|
7
3
|
module BrokenLinkFinder
|
8
4
|
DEFAULT_MAX_THREADS = 100
|
9
5
|
|
@@ -13,7 +9,7 @@ module BrokenLinkFinder
|
|
13
9
|
end
|
14
10
|
|
15
11
|
class Finder
|
16
|
-
attr_reader :sort, :
|
12
|
+
attr_reader :sort, :max_threads, :broken_links, :ignored_links, :crawl_stats
|
17
13
|
|
18
14
|
# Creates a new Finder instance.
|
19
15
|
def initialize(sort: :page, max_threads: BrokenLinkFinder::DEFAULT_MAX_THREADS)
|
@@ -25,35 +21,38 @@ module BrokenLinkFinder
|
|
25
21
|
@lock = Mutex.new
|
26
22
|
@crawler = Wgit::Crawler.new
|
27
23
|
|
28
|
-
|
24
|
+
reset_crawl
|
29
25
|
end
|
30
26
|
|
31
27
|
# Clear/empty the link collection Hashes.
|
32
|
-
def
|
28
|
+
def reset_crawl
|
33
29
|
@broken_links = {}
|
34
30
|
@ignored_links = {}
|
35
|
-
@
|
36
|
-
@
|
37
|
-
@
|
31
|
+
@all_broken_links = Set.new # Used to prevent crawling a link twice.
|
32
|
+
@all_intact_links = Set.new # "
|
33
|
+
@broken_link_map = {} # Maps a link to its absolute form.
|
34
|
+
@crawl_stats = {} # Records crawl stats e.g. duration etc.
|
38
35
|
end
|
39
36
|
|
40
37
|
# Finds broken links within a single page and appends them to the
|
41
38
|
# @broken_links array. Returns true if at least one broken link was found.
|
42
39
|
# Access the broken links afterwards with Finder#broken_links.
|
43
40
|
def crawl_url(url)
|
44
|
-
|
41
|
+
reset_crawl
|
45
42
|
|
46
|
-
|
47
|
-
|
43
|
+
start = Time.now
|
44
|
+
url = url.to_url
|
45
|
+
doc = @crawler.crawl(url)
|
48
46
|
|
49
47
|
# Ensure the given page url is valid.
|
50
48
|
raise "Invalid or broken URL: #{url}" unless doc
|
51
49
|
|
52
50
|
# Get all page links and determine which are broken.
|
53
51
|
find_broken_links(doc)
|
52
|
+
retry_broken_links
|
54
53
|
|
55
54
|
sort_links
|
56
|
-
|
55
|
+
set_crawl_stats(url: url, pages_crawled: [url], start: start)
|
57
56
|
|
58
57
|
@broken_links.any?
|
59
58
|
end
|
@@ -63,15 +62,16 @@ module BrokenLinkFinder
|
|
63
62
|
# at least one broken link was found and an Array of all pages crawled.
|
64
63
|
# Access the broken links afterwards with Finder#broken_links.
|
65
64
|
def crawl_site(url)
|
66
|
-
|
65
|
+
reset_crawl
|
67
66
|
|
68
|
-
|
69
|
-
|
70
|
-
|
67
|
+
start = Time.now
|
68
|
+
url = url.to_url
|
69
|
+
pool = Thread.pool(@max_threads)
|
70
|
+
crawled = Set.new
|
71
71
|
|
72
72
|
# Crawl the site's HTML web pages looking for links.
|
73
73
|
externals = @crawler.crawl_site(url) do |doc|
|
74
|
-
|
74
|
+
crawled << doc.url
|
75
75
|
next unless doc
|
76
76
|
|
77
77
|
# Start a thread for each page, checking for broken links.
|
@@ -83,30 +83,31 @@ module BrokenLinkFinder
|
|
83
83
|
|
84
84
|
# Wait for all threads to finish.
|
85
85
|
pool.shutdown
|
86
|
+
retry_broken_links
|
86
87
|
|
87
88
|
sort_links
|
88
|
-
|
89
|
+
set_crawl_stats(url: url, pages_crawled: crawled.to_a, start: start)
|
89
90
|
|
90
|
-
|
91
|
+
@broken_links.any?
|
91
92
|
end
|
92
93
|
|
93
94
|
# Pretty prints the link report into a stream e.g. STDOUT or a file,
|
94
95
|
# anything that respond_to? :puts. Defaults to STDOUT.
|
95
|
-
|
96
|
-
|
97
|
-
|
98
|
-
|
99
|
-
|
100
|
-
|
101
|
-
|
102
|
-
|
103
|
-
|
104
|
-
|
105
|
-
|
106
|
-
|
107
|
-
|
108
|
-
|
109
|
-
|
96
|
+
def report(stream = STDOUT,
|
97
|
+
type: :text, broken_verbose: true, ignored_verbose: false)
|
98
|
+
klass = case type
|
99
|
+
when :text
|
100
|
+
BrokenLinkFinder::TextReporter
|
101
|
+
when :html
|
102
|
+
BrokenLinkFinder::HTMLReporter
|
103
|
+
else
|
104
|
+
raise "type: must be :text or :html, not: :#{type}"
|
105
|
+
end
|
106
|
+
|
107
|
+
reporter = klass.new(stream, @sort, @broken_links,
|
108
|
+
@ignored_links, @broken_link_map, @crawl_stats)
|
109
|
+
reporter.call(broken_verbose: broken_verbose,
|
110
|
+
ignored_verbose: ignored_verbose)
|
110
111
|
end
|
111
112
|
|
112
113
|
private
|
@@ -117,11 +118,11 @@ module BrokenLinkFinder
|
|
117
118
|
|
118
119
|
# Iterate over the supported links checking if they're broken or not.
|
119
120
|
links.each do |link|
|
120
|
-
#
|
121
|
+
# Skip if the link has been processed previously.
|
121
122
|
next if @all_intact_links.include?(link)
|
122
123
|
|
123
124
|
if @all_broken_links.include?(link)
|
124
|
-
append_broken_link(page.url, link)
|
125
|
+
append_broken_link(page.url, link) # Record on which page.
|
125
126
|
next
|
126
127
|
end
|
127
128
|
|
@@ -129,10 +130,8 @@ module BrokenLinkFinder
|
|
129
130
|
link_doc = crawl_link(page, link)
|
130
131
|
|
131
132
|
# Determine if the crawled link is broken or not.
|
132
|
-
if
|
133
|
-
|
134
|
-
has_broken_anchor(link_doc)
|
135
|
-
append_broken_link(page.url, link)
|
133
|
+
if link_broken?(link_doc)
|
134
|
+
append_broken_link(page.url, link, doc: page)
|
136
135
|
else
|
137
136
|
@lock.synchronize { @all_intact_links << link }
|
138
137
|
end
|
@@ -141,6 +140,17 @@ module BrokenLinkFinder
|
|
141
140
|
nil
|
142
141
|
end
|
143
142
|
|
143
|
+
# Implements a retry mechanism for each of the broken links found.
|
144
|
+
# Removes any broken links found to be working OK.
|
145
|
+
def retry_broken_links
|
146
|
+
sleep(0.5) # Give the servers a break, then retry the links.
|
147
|
+
|
148
|
+
@broken_link_map.each do |link, href|
|
149
|
+
doc = @crawler.crawl(href)
|
150
|
+
remove_broken_link(link) unless link_broken?(doc)
|
151
|
+
end
|
152
|
+
end
|
153
|
+
|
144
154
|
# Report and reject any non supported links. Any link that is absolute and
|
145
155
|
# doesn't start with 'http' is unsupported e.g. 'mailto:blah' etc.
|
146
156
|
def get_supported_links(doc)
|
@@ -153,12 +163,17 @@ module BrokenLinkFinder
|
|
153
163
|
end
|
154
164
|
end
|
155
165
|
|
156
|
-
#
|
166
|
+
# Make the link absolute and crawl it, returning its Wgit::Document.
|
157
167
|
def crawl_link(doc, link)
|
158
168
|
link = link.prefix_base(doc)
|
159
169
|
@crawler.crawl(link)
|
160
170
|
end
|
161
171
|
|
172
|
+
# Return if the crawled link is broken or not.
|
173
|
+
def link_broken?(doc)
|
174
|
+
doc.nil? || @crawler.last_response.not_found? || has_broken_anchor(doc)
|
175
|
+
end
|
176
|
+
|
162
177
|
# Returns true if the link is/contains a broken anchor/fragment.
|
163
178
|
def has_broken_anchor(doc)
|
164
179
|
raise 'link document is nil' unless doc
|
@@ -170,7 +185,8 @@ module BrokenLinkFinder
|
|
170
185
|
end
|
171
186
|
|
172
187
|
# Append key => [value] to @broken_links.
|
173
|
-
|
188
|
+
# If doc: is provided then the link will be recorded in absolute form.
|
189
|
+
def append_broken_link(url, link, doc: nil)
|
174
190
|
key, value = get_key_value(url, link)
|
175
191
|
|
176
192
|
@lock.synchronize do
|
@@ -178,6 +194,23 @@ module BrokenLinkFinder
|
|
178
194
|
@broken_links[key] << value
|
179
195
|
|
180
196
|
@all_broken_links << link
|
197
|
+
|
198
|
+
@broken_link_map[link] = link.prefix_base(doc) if doc
|
199
|
+
end
|
200
|
+
end
|
201
|
+
|
202
|
+
# Remove the broken_link from the necessary collections.
|
203
|
+
def remove_broken_link(link)
|
204
|
+
@lock.synchronize do
|
205
|
+
if @sort == :page
|
206
|
+
@broken_links.each { |_k, links| links.delete(link) }
|
207
|
+
@broken_links.delete_if { |_k, links| links.empty? }
|
208
|
+
else
|
209
|
+
@broken_links.delete(link)
|
210
|
+
end
|
211
|
+
|
212
|
+
@all_broken_links.delete(link)
|
213
|
+
@all_intact_links << link
|
181
214
|
end
|
182
215
|
end
|
183
216
|
|
@@ -217,12 +250,15 @@ module BrokenLinkFinder
|
|
217
250
|
end
|
218
251
|
|
219
252
|
# Sets and returns the total number of links crawled.
|
220
|
-
def
|
221
|
-
@
|
253
|
+
def set_crawl_stats(url:, pages_crawled:, start:)
|
254
|
+
@crawl_stats[:url] = url
|
255
|
+
@crawl_stats[:pages_crawled] = pages_crawled
|
256
|
+
@crawl_stats[:num_pages] = pages_crawled.size
|
257
|
+
@crawl_stats[:num_links] = @all_broken_links.size + @all_intact_links.size
|
258
|
+
@crawl_stats[:duration] = Time.now - start
|
222
259
|
end
|
223
260
|
|
224
|
-
alias crawl_page
|
225
|
-
alias crawl_r
|
226
|
-
alias pretty_print_link_summary pretty_print_link_report
|
261
|
+
alias crawl_page crawl_url
|
262
|
+
alias crawl_r crawl_site
|
227
263
|
end
|
228
264
|
end
|
@@ -0,0 +1,134 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module BrokenLinkFinder
|
4
|
+
class HTMLReporter < Reporter
|
5
|
+
# Creates a new HTMLReporter instance.
|
6
|
+
# stream is any Object that responds to :puts and :print.
|
7
|
+
def initialize(stream, sort,
|
8
|
+
broken_links, ignored_links,
|
9
|
+
broken_link_map, crawl_stats)
|
10
|
+
super
|
11
|
+
end
|
12
|
+
|
13
|
+
# Pretty print a report detailing the full link summary.
|
14
|
+
def call(broken_verbose: true, ignored_verbose: false)
|
15
|
+
puts '<div class="broken_link_finder_report">'
|
16
|
+
|
17
|
+
report_crawl_summary
|
18
|
+
report_broken_links(verbose: broken_verbose)
|
19
|
+
report_ignored_links(verbose: ignored_verbose)
|
20
|
+
|
21
|
+
puts '</div>'
|
22
|
+
|
23
|
+
nil
|
24
|
+
end
|
25
|
+
|
26
|
+
private
|
27
|
+
|
28
|
+
# Report a summary of the overall crawl.
|
29
|
+
def report_crawl_summary
|
30
|
+
puts format(
|
31
|
+
'<p class="crawl_summary">Crawled %s (%s page(s) in %s seconds)</p>',
|
32
|
+
@crawl_stats[:url],
|
33
|
+
@crawl_stats[:num_pages],
|
34
|
+
@crawl_stats[:duration]&.truncate(2)
|
35
|
+
)
|
36
|
+
end
|
37
|
+
|
38
|
+
# Report a summary of the broken links.
|
39
|
+
def report_broken_links(verbose: true)
|
40
|
+
puts '<div class="broken_links">'
|
41
|
+
|
42
|
+
if @broken_links.empty?
|
43
|
+
puts_summary 'Good news, there are no broken links!', type: :broken
|
44
|
+
else
|
45
|
+
num_pages, num_links = get_hash_stats(@broken_links)
|
46
|
+
puts_summary "Found #{num_links} broken link(s) across #{num_pages} page(s):", type: :broken
|
47
|
+
|
48
|
+
@broken_links.each do |key, values|
|
49
|
+
puts_group(key, type: :broken) # Puts the opening <p> element.
|
50
|
+
|
51
|
+
if verbose || (values.length <= NUM_VALUES)
|
52
|
+
values.each { |value| puts_group_item value, type: :broken }
|
53
|
+
else # Only print N values and summarise the rest.
|
54
|
+
NUM_VALUES.times { |i| puts_group_item values[i], type: :broken }
|
55
|
+
|
56
|
+
objects = sort_by_page? ? 'link(s)' : 'page(s)'
|
57
|
+
puts "+ #{values.length - NUM_VALUES} other #{objects}, remove --concise to see them all<br />"
|
58
|
+
end
|
59
|
+
|
60
|
+
puts '</p>'
|
61
|
+
end
|
62
|
+
end
|
63
|
+
|
64
|
+
puts '</div>'
|
65
|
+
end
|
66
|
+
|
67
|
+
# Report a summary of the ignored links.
|
68
|
+
def report_ignored_links(verbose: false)
|
69
|
+
puts '<div class="ignored_links">'
|
70
|
+
|
71
|
+
if @ignored_links.any?
|
72
|
+
num_pages, num_links = get_hash_stats(@ignored_links)
|
73
|
+
puts_summary "Ignored #{num_links} unsupported link(s) across #{num_pages} page(s), which you should check manually:", type: :ignored
|
74
|
+
|
75
|
+
@ignored_links.each do |key, values|
|
76
|
+
puts_group(key, type: :ignored) # Puts the opening <p> element.
|
77
|
+
|
78
|
+
if verbose || (values.length <= NUM_VALUES)
|
79
|
+
values.each { |value| puts_group_item value, type: :ignored }
|
80
|
+
else # Only print N values and summarise the rest.
|
81
|
+
NUM_VALUES.times { |i| puts_group_item values[i], type: :ignored }
|
82
|
+
|
83
|
+
objects = sort_by_page? ? 'link(s)' : 'page(s)'
|
84
|
+
puts "+ #{values.length - NUM_VALUES} other #{objects}, use --verbose to see them all<br />"
|
85
|
+
end
|
86
|
+
|
87
|
+
puts '</p>'
|
88
|
+
end
|
89
|
+
end
|
90
|
+
|
91
|
+
puts '</div>'
|
92
|
+
end
|
93
|
+
|
94
|
+
def puts_summary(text, type:)
|
95
|
+
klass = (type == :broken) ? 'broken_links_summary' : 'ignored_links_summary'
|
96
|
+
puts "<p class=\"#{klass}\">#{text}</p>"
|
97
|
+
end
|
98
|
+
|
99
|
+
def puts_group(link, type:)
|
100
|
+
href = build_url(link)
|
101
|
+
a_element = "<a href=\"#{href}\">#{link}</a>"
|
102
|
+
|
103
|
+
case type
|
104
|
+
when :broken
|
105
|
+
msg = sort_by_page? ?
|
106
|
+
"The following broken links were found on '#{a_element}':" :
|
107
|
+
"The broken link '#{a_element}' was found on the following pages:"
|
108
|
+
klass = 'broken_links_group'
|
109
|
+
when :ignored
|
110
|
+
msg = sort_by_page? ?
|
111
|
+
"The following links were ignored on '#{a_element}':" :
|
112
|
+
"The link '#{a_element}' was ignored on the following pages:"
|
113
|
+
klass = 'ignored_links_group'
|
114
|
+
else
|
115
|
+
raise "type: must be :broken or :ignored, not: #{type}"
|
116
|
+
end
|
117
|
+
|
118
|
+
puts "<p class=\"#{klass}\">"
|
119
|
+
puts msg + '<br />'
|
120
|
+
end
|
121
|
+
|
122
|
+
def puts_group_item(value, type:)
|
123
|
+
klass = (type == :broken) ? 'broken_links_group_item' : 'ignored_links_group_item'
|
124
|
+
puts "<a class=\"#{klass}\" href=\"#{build_url(value)}\">#{value}</a><br />"
|
125
|
+
end
|
126
|
+
|
127
|
+
def build_url(link)
|
128
|
+
return link if link.to_url.absolute?
|
129
|
+
@broken_link_map.fetch(link)
|
130
|
+
end
|
131
|
+
|
132
|
+
alias_method :report, :call
|
133
|
+
end
|
134
|
+
end
|
@@ -0,0 +1,77 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module BrokenLinkFinder
|
4
|
+
# Generic reporter class to be inherited from by format specific reporters.
|
5
|
+
class Reporter
|
6
|
+
# The amount of pages/links to display when verbose is false.
|
7
|
+
NUM_VALUES = 3
|
8
|
+
|
9
|
+
# Creates a new Reporter instance.
|
10
|
+
# stream is any Object that responds to :puts and :print.
|
11
|
+
def initialize(stream, sort,
|
12
|
+
broken_links, ignored_links,
|
13
|
+
broken_link_map, crawl_stats)
|
14
|
+
unless stream.respond_to?(:puts) && stream.respond_to?(:print)
|
15
|
+
raise 'stream must respond_to? :puts and :print'
|
16
|
+
end
|
17
|
+
raise "sort by either :page or :link, not #{sort}" \
|
18
|
+
unless %i[page link].include?(sort)
|
19
|
+
|
20
|
+
@stream = stream
|
21
|
+
@sort = sort
|
22
|
+
@broken_links = broken_links
|
23
|
+
@ignored_links = ignored_links
|
24
|
+
@broken_link_map = broken_link_map
|
25
|
+
@crawl_stats = crawl_stats
|
26
|
+
end
|
27
|
+
|
28
|
+
# Pretty print a report detailing the full link summary.
|
29
|
+
def call(broken_verbose: true, ignored_verbose: false)
|
30
|
+
raise 'Not implemented by parent class'
|
31
|
+
end
|
32
|
+
|
33
|
+
protected
|
34
|
+
|
35
|
+
# Return true if the sort is by page.
|
36
|
+
def sort_by_page?
|
37
|
+
@sort == :page
|
38
|
+
end
|
39
|
+
|
40
|
+
# Returns the key/value statistics of hash e.g. the number of keys and
|
41
|
+
# combined values. The hash should be of the format: { 'str' => [...] }.
|
42
|
+
# Use like: `num_pages, num_links = get_hash_stats(links)`.
|
43
|
+
def get_hash_stats(hash)
|
44
|
+
num_keys = hash.keys.length
|
45
|
+
values = hash.values.flatten
|
46
|
+
num_values = sort_by_page? ? values.length : values.uniq.length
|
47
|
+
|
48
|
+
sort_by_page? ?
|
49
|
+
[num_keys, num_values] :
|
50
|
+
[num_values, num_keys]
|
51
|
+
end
|
52
|
+
|
53
|
+
# Prints the text. Defaults to a blank line.
|
54
|
+
def print(text = '')
|
55
|
+
@stream.print(text)
|
56
|
+
end
|
57
|
+
|
58
|
+
# Prints the text + \n. Defaults to a blank line.
|
59
|
+
def puts(text = '')
|
60
|
+
@stream.puts(text)
|
61
|
+
end
|
62
|
+
|
63
|
+
# Prints text + \n\n.
|
64
|
+
def putsn(text)
|
65
|
+
puts(text)
|
66
|
+
puts
|
67
|
+
end
|
68
|
+
|
69
|
+
# Prints \n + text + \n.
|
70
|
+
def nputs(text)
|
71
|
+
puts
|
72
|
+
puts(text)
|
73
|
+
end
|
74
|
+
|
75
|
+
alias_method :report, :call
|
76
|
+
end
|
77
|
+
end
|
@@ -0,0 +1,86 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module BrokenLinkFinder
|
4
|
+
class TextReporter < Reporter
|
5
|
+
# Creates a new TextReporter instance.
|
6
|
+
# stream is any Object that responds to :puts and :print.
|
7
|
+
def initialize(stream, sort,
|
8
|
+
broken_links, ignored_links,
|
9
|
+
broken_link_map, crawl_stats)
|
10
|
+
super
|
11
|
+
end
|
12
|
+
|
13
|
+
# Pretty print a report detailing the full link summary.
|
14
|
+
def call(broken_verbose: true, ignored_verbose: false)
|
15
|
+
report_crawl_summary
|
16
|
+
report_broken_links(verbose: broken_verbose)
|
17
|
+
report_ignored_links(verbose: ignored_verbose)
|
18
|
+
|
19
|
+
nil
|
20
|
+
end
|
21
|
+
|
22
|
+
private
|
23
|
+
|
24
|
+
# Report a summary of the overall crawl.
|
25
|
+
def report_crawl_summary
|
26
|
+
putsn format(
|
27
|
+
'Crawled %s (%s page(s) in %s seconds)',
|
28
|
+
@crawl_stats[:url],
|
29
|
+
@crawl_stats[:num_pages],
|
30
|
+
@crawl_stats[:duration]&.truncate(2)
|
31
|
+
)
|
32
|
+
end
|
33
|
+
|
34
|
+
# Report a summary of the broken links.
|
35
|
+
def report_broken_links(verbose: true)
|
36
|
+
if @broken_links.empty?
|
37
|
+
puts 'Good news, there are no broken links!'
|
38
|
+
else
|
39
|
+
num_pages, num_links = get_hash_stats(@broken_links)
|
40
|
+
puts "Found #{num_links} broken link(s) across #{num_pages} page(s):"
|
41
|
+
|
42
|
+
@broken_links.each do |key, values|
|
43
|
+
msg = sort_by_page? ?
|
44
|
+
"The following broken links were found on '#{key}':" :
|
45
|
+
"The broken link '#{key}' was found on the following pages:"
|
46
|
+
nputs msg
|
47
|
+
|
48
|
+
if verbose || (values.length <= NUM_VALUES)
|
49
|
+
values.each { |value| puts value }
|
50
|
+
else # Only print N values and summarise the rest.
|
51
|
+
NUM_VALUES.times { |i| puts values[i] }
|
52
|
+
|
53
|
+
objects = sort_by_page? ? 'link(s)' : 'page(s)'
|
54
|
+
puts "+ #{values.length - NUM_VALUES} other #{objects}, remove --concise to see them all"
|
55
|
+
end
|
56
|
+
end
|
57
|
+
end
|
58
|
+
end
|
59
|
+
|
60
|
+
# Report a summary of the ignored links.
|
61
|
+
def report_ignored_links(verbose: false)
|
62
|
+
if @ignored_links.any?
|
63
|
+
num_pages, num_links = get_hash_stats(@ignored_links)
|
64
|
+
nputs "Ignored #{num_links} unsupported link(s) across #{num_pages} page(s), which you should check manually:"
|
65
|
+
|
66
|
+
@ignored_links.each do |key, values|
|
67
|
+
msg = sort_by_page? ?
|
68
|
+
"The following links were ignored on '#{key}':" :
|
69
|
+
"The link '#{key}' was ignored on the following pages:"
|
70
|
+
nputs msg
|
71
|
+
|
72
|
+
if verbose || (values.length <= NUM_VALUES)
|
73
|
+
values.each { |value| puts value }
|
74
|
+
else # Only print N values and summarise the rest.
|
75
|
+
NUM_VALUES.times { |i| puts values[i] }
|
76
|
+
|
77
|
+
objects = sort_by_page? ? 'link(s)' : 'page(s)'
|
78
|
+
puts "+ #{values.length - NUM_VALUES} other #{objects}, use --verbose to see them all"
|
79
|
+
end
|
80
|
+
end
|
81
|
+
end
|
82
|
+
end
|
83
|
+
|
84
|
+
alias_method :report, :call
|
85
|
+
end
|
86
|
+
end
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: broken_link_finder
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
4
|
+
version: 0.10.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Michael Telford
|
8
8
|
autorequire:
|
9
9
|
bindir: exe
|
10
10
|
cert_chain: []
|
11
|
-
date: 2019-11-
|
11
|
+
date: 2019-11-28 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: bundler
|
@@ -159,7 +159,9 @@ files:
|
|
159
159
|
- exe/broken_link_finder
|
160
160
|
- lib/broken_link_finder.rb
|
161
161
|
- lib/broken_link_finder/finder.rb
|
162
|
-
- lib/broken_link_finder/reporter.rb
|
162
|
+
- lib/broken_link_finder/reporter/html_reporter.rb
|
163
|
+
- lib/broken_link_finder/reporter/reporter.rb
|
164
|
+
- lib/broken_link_finder/reporter/text_reporter.rb
|
163
165
|
- lib/broken_link_finder/version.rb
|
164
166
|
- lib/broken_link_finder/wgit_extensions.rb
|
165
167
|
- load.rb
|
@@ -187,8 +189,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
187
189
|
- !ruby/object:Gem::Version
|
188
190
|
version: '0'
|
189
191
|
requirements: []
|
190
|
-
|
191
|
-
rubygems_version: 2.7.6
|
192
|
+
rubygems_version: 3.0.6
|
192
193
|
signing_key:
|
193
194
|
specification_version: 4
|
194
195
|
summary: Finds a website's broken links and reports back to you with a summary.
|
@@ -1,116 +0,0 @@
|
|
1
|
-
# frozen_string_literal: true
|
2
|
-
|
3
|
-
module BrokenLinkFinder
|
4
|
-
class Reporter
|
5
|
-
# The amount of pages/links to display when verbose is false.
|
6
|
-
NUM_VALUES = 3
|
7
|
-
|
8
|
-
# Creates a new Reporter instance.
|
9
|
-
# stream is any Object that responds to :puts.
|
10
|
-
def initialize(stream, sort, broken_links, ignored_links)
|
11
|
-
raise 'stream must respond_to? :puts' unless stream.respond_to?(:puts)
|
12
|
-
raise "sort by either :page or :link, not #{sort}" \
|
13
|
-
unless %i[page link].include?(sort)
|
14
|
-
|
15
|
-
@stream = stream
|
16
|
-
@sort = sort
|
17
|
-
@broken_links = broken_links
|
18
|
-
@ignored_links = ignored_links
|
19
|
-
end
|
20
|
-
|
21
|
-
# Pretty print a report detailing the link summary.
|
22
|
-
def pretty_print_link_report(broken_verbose: true, ignored_verbose: false)
|
23
|
-
report_broken_links(verbose: broken_verbose)
|
24
|
-
report_ignored_links(verbose: ignored_verbose)
|
25
|
-
|
26
|
-
nil
|
27
|
-
end
|
28
|
-
|
29
|
-
private
|
30
|
-
|
31
|
-
# Report a summary of the broken links.
|
32
|
-
def report_broken_links(verbose: true)
|
33
|
-
if @broken_links.empty?
|
34
|
-
print 'Good news, there are no broken links!'
|
35
|
-
else
|
36
|
-
num_pages, num_links = get_hash_stats(@broken_links)
|
37
|
-
print "Found #{num_links} broken link(s) across #{num_pages} page(s):"
|
38
|
-
|
39
|
-
@broken_links.each do |key, values|
|
40
|
-
msg = sort_by_page? ?
|
41
|
-
"The following broken links were found on '#{key}':" :
|
42
|
-
"The broken link '#{key}' was found on the following pages:"
|
43
|
-
nprint msg
|
44
|
-
|
45
|
-
if verbose || (values.length <= NUM_VALUES)
|
46
|
-
values.each { |value| print value }
|
47
|
-
else # Only print N values and summarise the rest.
|
48
|
-
NUM_VALUES.times { |i| print values[i] }
|
49
|
-
|
50
|
-
objects = sort_by_page? ? 'link(s)' : 'page(s)'
|
51
|
-
print "+ #{values.length - NUM_VALUES} other #{objects}, remove --concise to see them all"
|
52
|
-
end
|
53
|
-
end
|
54
|
-
end
|
55
|
-
end
|
56
|
-
|
57
|
-
# Report a summary of the ignored links.
|
58
|
-
def report_ignored_links(verbose: false)
|
59
|
-
if @ignored_links.any?
|
60
|
-
num_pages, num_links = get_hash_stats(@ignored_links)
|
61
|
-
nprint "Ignored #{num_links} unsupported link(s) across #{num_pages} page(s), which you should check manually:"
|
62
|
-
|
63
|
-
@ignored_links.each do |key, values|
|
64
|
-
msg = sort_by_page? ?
|
65
|
-
"The following links were ignored on '#{key}':" :
|
66
|
-
"The link '#{key}' was ignored on the following pages:"
|
67
|
-
nprint msg
|
68
|
-
|
69
|
-
if verbose || (values.length <= NUM_VALUES)
|
70
|
-
values.each { |value| print value }
|
71
|
-
else # Only print N values and summarise the rest.
|
72
|
-
NUM_VALUES.times { |i| print values[i] }
|
73
|
-
|
74
|
-
objects = sort_by_page? ? 'link(s)' : 'page(s)'
|
75
|
-
print "+ #{values.length - NUM_VALUES} other #{objects}, use --verbose to see them all"
|
76
|
-
end
|
77
|
-
end
|
78
|
-
end
|
79
|
-
end
|
80
|
-
|
81
|
-
# Return true if the sort is by page.
|
82
|
-
def sort_by_page?
|
83
|
-
@sort == :page
|
84
|
-
end
|
85
|
-
|
86
|
-
# Returns the key/value statistics of hash e.g. the number of keys and
|
87
|
-
# combined values. The hash should be of the format: { 'str' => [...] }.
|
88
|
-
# Use like: `num_pages, num_links = get_hash_stats(links)`.
|
89
|
-
def get_hash_stats(hash)
|
90
|
-
num_keys = hash.keys.length
|
91
|
-
values = hash.values.flatten
|
92
|
-
num_values = sort_by_page? ? values.length : values.uniq.length
|
93
|
-
|
94
|
-
sort_by_page? ?
|
95
|
-
[num_keys, num_values] :
|
96
|
-
[num_values, num_keys]
|
97
|
-
end
|
98
|
-
|
99
|
-
# Prints the text + \n. Defaults to a blank line.
|
100
|
-
def print(text = '')
|
101
|
-
@stream.puts(text)
|
102
|
-
end
|
103
|
-
|
104
|
-
# Prints text + \n\n.
|
105
|
-
def printn(text)
|
106
|
-
print(text)
|
107
|
-
print
|
108
|
-
end
|
109
|
-
|
110
|
-
# Prints \n + text + \n.
|
111
|
-
def nprint(text)
|
112
|
-
print
|
113
|
-
print(text)
|
114
|
-
end
|
115
|
-
end
|
116
|
-
end
|