broken_link_finder 0.9.5 → 0.10.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +11 -0
- data/Gemfile.lock +3 -3
- data/README.md +18 -14
- data/benchmark.rb +9 -5
- data/bin/console +4 -14
- data/exe/broken_link_finder +5 -2
- data/lib/broken_link_finder.rb +5 -1
- data/lib/broken_link_finder/finder.rb +85 -49
- data/lib/broken_link_finder/reporter/html_reporter.rb +134 -0
- data/lib/broken_link_finder/reporter/reporter.rb +77 -0
- data/lib/broken_link_finder/reporter/text_reporter.rb +86 -0
- data/lib/broken_link_finder/version.rb +1 -1
- metadata +6 -5
- data/lib/broken_link_finder/reporter.rb +0 -116
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 7a53784c1bd2f75c18b3492ea782b4cc2e229a94f89afcf33b60ef633512554e
|
4
|
+
data.tar.gz: 393dca220b7f00d72314c93e7b877e0412afdf784fa2e563bbecb2dc6c6b29f7
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: c0d304e5b0a9258265c5c084c0a6e5819c169ba8eb02b3c6317a37784a9ca12982b0fc520c3cca1060fde60126ee936708d7891c69133c5d72c9c0287a79b3f5
|
7
|
+
data.tar.gz: c21a4aec2c077e2617fb625debad28f746148ad98229a27a590a4412601e30759c709aa3a6e6d80e81c16160e16968fc0392181fc9c75e4da06578452f7c5ab6
|
data/CHANGELOG.md
CHANGED
@@ -9,6 +9,17 @@
|
|
9
9
|
- ...
|
10
10
|
---
|
11
11
|
|
12
|
+
## v0.10.0
|
13
|
+
### Added
|
14
|
+
- A `--html` flag to the `crawl` executable command which produces a HTML report (instead of text).
|
15
|
+
- Added a 'retry' mechanism for any broken links found. This is essentially a verification step before generating a report.
|
16
|
+
- `Finder#crawl_stats` for info such as crawl duration, total links crawled etc.
|
17
|
+
### Changed/Removed
|
18
|
+
- The API has changed somewhat. See the [docs](https://www.rubydoc.info/gems/broken_link_finder) for the up to date code signatures if you're using `broken_link_finder` outside of its executable.
|
19
|
+
### Fixed
|
20
|
+
- ...
|
21
|
+
---
|
22
|
+
|
12
23
|
## v0.9.5
|
13
24
|
### Added
|
14
25
|
- ...
|
data/Gemfile.lock
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
PATH
|
2
2
|
remote: .
|
3
3
|
specs:
|
4
|
-
broken_link_finder (0.
|
4
|
+
broken_link_finder (0.10.0)
|
5
5
|
thor (~> 0.20)
|
6
6
|
thread (~> 0.2)
|
7
7
|
wgit (~> 0.5)
|
@@ -18,7 +18,7 @@ GEM
|
|
18
18
|
safe_yaml (~> 1.0.0)
|
19
19
|
ethon (0.12.0)
|
20
20
|
ffi (>= 1.3.0)
|
21
|
-
ffi (1.11.
|
21
|
+
ffi (1.11.3)
|
22
22
|
hashdiff (1.0.0)
|
23
23
|
maxitest (3.4.0)
|
24
24
|
minitest (>= 5.0.0, < 5.13.0)
|
@@ -65,4 +65,4 @@ RUBY VERSION
|
|
65
65
|
ruby 2.5.3p105
|
66
66
|
|
67
67
|
BUNDLED WITH
|
68
|
-
2.0.
|
68
|
+
2.0.2
|
data/README.md
CHANGED
@@ -57,7 +57,7 @@ Installing this gem installs the `broken_link_finder` executable into your `$PAT
|
|
57
57
|
|
58
58
|
$ broken_link_finder crawl http://txti.es
|
59
59
|
|
60
|
-
Adding the
|
60
|
+
Adding the `--recursive` flag would crawl the entire `txti.es` site, not just its index page.
|
61
61
|
|
62
62
|
See the [output](#Output) section below for an example of a site with broken links.
|
63
63
|
|
@@ -76,7 +76,7 @@ require 'broken_link_finder'
|
|
76
76
|
|
77
77
|
finder = BrokenLinkFinder.new
|
78
78
|
finder.crawl_site 'http://txti.es' # Or use Finder#crawl_page for a single webpage.
|
79
|
-
finder.
|
79
|
+
finder.report # Or use Finder#broken_links and Finder#ignored_links
|
80
80
|
# for direct access to the link Hashes.
|
81
81
|
```
|
82
82
|
|
@@ -91,13 +91,15 @@ See the full source code documentation [here](https://www.rubydoc.info/gems/brok
|
|
91
91
|
If broken links are found then the output will look something like:
|
92
92
|
|
93
93
|
```text
|
94
|
+
Crawled http://txti.es (7 page(s) in 7.88 seconds)
|
95
|
+
|
94
96
|
Found 6 broken link(s) across 2 page(s):
|
95
97
|
|
96
98
|
The following broken links were found on 'http://txti.es/about':
|
97
99
|
http://twitter.com/thebarrytone
|
100
|
+
/doesntexist
|
98
101
|
http://twitter.com/nwbld
|
99
|
-
|
100
|
-
https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=84L4BDS86FBUU
|
102
|
+
twitter.com/txties
|
101
103
|
|
102
104
|
The following broken links were found on 'http://txti.es/how':
|
103
105
|
http://en.wikipedia.org/wiki/Markdown
|
@@ -105,14 +107,16 @@ http://imgur.com
|
|
105
107
|
|
106
108
|
Ignored 3 unsupported link(s) across 2 page(s), which you should check manually:
|
107
109
|
|
108
|
-
The following links were ignored on http://txti.es:
|
110
|
+
The following links were ignored on 'http://txti.es':
|
109
111
|
tel:+13174562564
|
110
112
|
mailto:big.jim@jmail.com
|
111
113
|
|
112
|
-
The following links were ignored on http://txti.es/contact:
|
114
|
+
The following links were ignored on 'http://txti.es/contact':
|
113
115
|
ftp://server.com
|
114
116
|
```
|
115
117
|
|
118
|
+
You can provide the `--html` flag if you'd prefer a HTML based report.
|
119
|
+
|
116
120
|
## Contributing
|
117
121
|
|
118
122
|
Bug reports and feature requests are welcome on [GitHub](https://github.com/michaeltelford/broken-link-finder). Just raise an issue.
|
@@ -128,11 +132,11 @@ After checking out the repo, run `bin/setup` to install dependencies. Then, run
|
|
128
132
|
To install this gem onto your local machine, run `bundle exec rake install`.
|
129
133
|
|
130
134
|
To release a new gem version:
|
131
|
-
- Update the deps in the `*.gemspec
|
132
|
-
- Update the version number in `version.rb` and add the new version to the `CHANGELOG
|
133
|
-
- Run `bundle install
|
134
|
-
- Run `bundle exec rake test` ensuring all tests pass
|
135
|
-
- Run `bundle exec rake compile` ensuring no warnings
|
136
|
-
- Run `bundle exec rake install && rbenv rehash
|
137
|
-
- Manually test the executable
|
138
|
-
- Run `bundle exec rake release[origin]
|
135
|
+
- Update the deps in the `*.gemspec`, if necessary.
|
136
|
+
- Update the version number in `version.rb` and add the new version to the `CHANGELOG`.
|
137
|
+
- Run `bundle install`.
|
138
|
+
- Run `bundle exec rake test` ensuring all tests pass.
|
139
|
+
- Run `bundle exec rake compile` ensuring no warnings.
|
140
|
+
- Run `bundle exec rake install && rbenv rehash`.
|
141
|
+
- Manually test the executable.
|
142
|
+
- Run `bundle exec rake release[origin]`.
|
data/benchmark.rb
CHANGED
@@ -10,15 +10,19 @@ finder = BrokenLinkFinder::Finder.new
|
|
10
10
|
puts Benchmark.measure { finder.crawl_site url }
|
11
11
|
puts "Links crawled: #{finder.total_links_crawled}"
|
12
12
|
|
13
|
-
# http://txti.es page crawl
|
14
|
-
# Pre
|
15
|
-
# Post
|
13
|
+
# http://txti.es page crawl with threading
|
14
|
+
# Pre: 17.5 seconds
|
15
|
+
# Post: 7.5 seconds
|
16
16
|
|
17
|
-
# http://txti.es
|
17
|
+
# http://txti.es with threading - page vs site crawl
|
18
18
|
# Page: 9.526981
|
19
19
|
# Site: 9.732416
|
20
20
|
# Multi-threading crawl_site now yields the same time as a single page
|
21
21
|
|
22
|
-
# Large site crawl -
|
22
|
+
# Large site crawl - all link recording functionality
|
23
23
|
# Pre: 608 seconds with 7665 links crawled
|
24
24
|
# Post: 355 seconds with 1099 links crawled
|
25
|
+
|
26
|
+
# Large site crawl - retry mechanism
|
27
|
+
# Pre: 140 seconds
|
28
|
+
# Post: 170 seconds
|
data/bin/console
CHANGED
@@ -5,20 +5,10 @@ require 'bundler/setup'
|
|
5
5
|
require 'pry'
|
6
6
|
require 'byebug'
|
7
7
|
require 'broken_link_finder'
|
8
|
+
require 'logger'
|
8
9
|
|
9
|
-
#
|
10
|
-
|
11
|
-
singleton_class.class_eval do
|
12
|
-
alias_method :orig_get, :get
|
13
|
-
end
|
14
|
-
|
15
|
-
def self.get(base_url, options = {})
|
16
|
-
puts "[typhoeus] Sending GET: #{base_url}"
|
17
|
-
resp = orig_get(base_url, options)
|
18
|
-
puts "[typhoeus] Status: #{resp.code} (#{resp.body.length} bytes in #{resp.total_time} seconds)"
|
19
|
-
resp
|
20
|
-
end
|
21
|
-
end
|
10
|
+
# Logs all HTTP requests.
|
11
|
+
Wgit.logger.level = Logger::DEBUG
|
22
12
|
|
23
13
|
# Call reload to load all recent code changes.
|
24
14
|
def reload
|
@@ -39,6 +29,6 @@ by_link = Finder.new sort: :link
|
|
39
29
|
finder = by_page
|
40
30
|
|
41
31
|
# Start the console.
|
42
|
-
puts "\nbroken_link_finder v#{BrokenLinkFinder::VERSION}"
|
32
|
+
puts "\nbroken_link_finder v#{BrokenLinkFinder::VERSION} (#{Wgit.version_str})"
|
43
33
|
|
44
34
|
binding.pry
|
data/exe/broken_link_finder
CHANGED
@@ -9,12 +9,14 @@ class BrokenLinkFinderCLI < Thor
|
|
9
9
|
desc 'crawl [URL]', 'Find broken links at the URL'
|
10
10
|
option :recursive, type: :boolean, aliases: [:r], default: false, desc: 'Crawl the entire site.'
|
11
11
|
option :threads, type: :numeric, aliases: [:t], default: BrokenLinkFinder::DEFAULT_MAX_THREADS, desc: 'Max number of threads to use when crawling recursively; 1 thread per web page.'
|
12
|
+
option :html, type: :boolean, aliases: [:h], default: false, desc: 'Produce a HTML report (instead of text)'
|
12
13
|
option :sort_by_link, type: :boolean, aliases: [:l], default: false, desc: 'Makes report more concise if there are more pages crawled than broken links found. Use with -r on medium/large sites.'
|
13
14
|
option :verbose, type: :boolean, aliases: [:v], default: false, desc: 'Display all ignored links.'
|
14
15
|
option :concise, type: :boolean, aliases: [:c], default: false, desc: 'Display only a summary of broken links.'
|
15
16
|
def crawl(url)
|
16
17
|
url = "http://#{url}" unless url.start_with?('http')
|
17
18
|
|
19
|
+
report_type = options[:html] ? :html : :text
|
18
20
|
sort_by = options[:sort_by_link] ? :link : :page
|
19
21
|
max_threads = options[:threads]
|
20
22
|
broken_verbose = !options[:concise]
|
@@ -22,8 +24,9 @@ class BrokenLinkFinderCLI < Thor
|
|
22
24
|
|
23
25
|
finder = BrokenLinkFinder::Finder.new(sort: sort_by, max_threads: max_threads)
|
24
26
|
options[:recursive] ? finder.crawl_site(url) : finder.crawl_page(url)
|
25
|
-
finder.
|
26
|
-
|
27
|
+
finder.report(
|
28
|
+
type: report_type,
|
29
|
+
broken_verbose: broken_verbose,
|
27
30
|
ignored_verbose: ignored_verbose
|
28
31
|
)
|
29
32
|
rescue Exception => e
|
data/lib/broken_link_finder.rb
CHANGED
@@ -2,8 +2,12 @@
|
|
2
2
|
|
3
3
|
require 'wgit'
|
4
4
|
require 'wgit/core_ext'
|
5
|
+
require 'thread/pool'
|
6
|
+
require 'set'
|
5
7
|
|
6
8
|
require_relative './broken_link_finder/wgit_extensions'
|
7
9
|
require_relative './broken_link_finder/version'
|
8
|
-
require_relative './broken_link_finder/reporter'
|
10
|
+
require_relative './broken_link_finder/reporter/reporter'
|
11
|
+
require_relative './broken_link_finder/reporter/text_reporter'
|
12
|
+
require_relative './broken_link_finder/reporter/html_reporter'
|
9
13
|
require_relative './broken_link_finder/finder'
|
@@ -1,9 +1,5 @@
|
|
1
1
|
# frozen_string_literal: true
|
2
2
|
|
3
|
-
require_relative 'reporter'
|
4
|
-
require 'thread/pool'
|
5
|
-
require 'set'
|
6
|
-
|
7
3
|
module BrokenLinkFinder
|
8
4
|
DEFAULT_MAX_THREADS = 100
|
9
5
|
|
@@ -13,7 +9,7 @@ module BrokenLinkFinder
|
|
13
9
|
end
|
14
10
|
|
15
11
|
class Finder
|
16
|
-
attr_reader :sort, :
|
12
|
+
attr_reader :sort, :max_threads, :broken_links, :ignored_links, :crawl_stats
|
17
13
|
|
18
14
|
# Creates a new Finder instance.
|
19
15
|
def initialize(sort: :page, max_threads: BrokenLinkFinder::DEFAULT_MAX_THREADS)
|
@@ -25,35 +21,38 @@ module BrokenLinkFinder
|
|
25
21
|
@lock = Mutex.new
|
26
22
|
@crawler = Wgit::Crawler.new
|
27
23
|
|
28
|
-
|
24
|
+
reset_crawl
|
29
25
|
end
|
30
26
|
|
31
27
|
# Clear/empty the link collection Hashes.
|
32
|
-
def
|
28
|
+
def reset_crawl
|
33
29
|
@broken_links = {}
|
34
30
|
@ignored_links = {}
|
35
|
-
@
|
36
|
-
@
|
37
|
-
@
|
31
|
+
@all_broken_links = Set.new # Used to prevent crawling a link twice.
|
32
|
+
@all_intact_links = Set.new # "
|
33
|
+
@broken_link_map = {} # Maps a link to its absolute form.
|
34
|
+
@crawl_stats = {} # Records crawl stats e.g. duration etc.
|
38
35
|
end
|
39
36
|
|
40
37
|
# Finds broken links within a single page and appends them to the
|
41
38
|
# @broken_links array. Returns true if at least one broken link was found.
|
42
39
|
# Access the broken links afterwards with Finder#broken_links.
|
43
40
|
def crawl_url(url)
|
44
|
-
|
41
|
+
reset_crawl
|
45
42
|
|
46
|
-
|
47
|
-
|
43
|
+
start = Time.now
|
44
|
+
url = url.to_url
|
45
|
+
doc = @crawler.crawl(url)
|
48
46
|
|
49
47
|
# Ensure the given page url is valid.
|
50
48
|
raise "Invalid or broken URL: #{url}" unless doc
|
51
49
|
|
52
50
|
# Get all page links and determine which are broken.
|
53
51
|
find_broken_links(doc)
|
52
|
+
retry_broken_links
|
54
53
|
|
55
54
|
sort_links
|
56
|
-
|
55
|
+
set_crawl_stats(url: url, pages_crawled: [url], start: start)
|
57
56
|
|
58
57
|
@broken_links.any?
|
59
58
|
end
|
@@ -63,15 +62,16 @@ module BrokenLinkFinder
|
|
63
62
|
# at least one broken link was found and an Array of all pages crawled.
|
64
63
|
# Access the broken links afterwards with Finder#broken_links.
|
65
64
|
def crawl_site(url)
|
66
|
-
|
65
|
+
reset_crawl
|
67
66
|
|
68
|
-
|
69
|
-
|
70
|
-
|
67
|
+
start = Time.now
|
68
|
+
url = url.to_url
|
69
|
+
pool = Thread.pool(@max_threads)
|
70
|
+
crawled = Set.new
|
71
71
|
|
72
72
|
# Crawl the site's HTML web pages looking for links.
|
73
73
|
externals = @crawler.crawl_site(url) do |doc|
|
74
|
-
|
74
|
+
crawled << doc.url
|
75
75
|
next unless doc
|
76
76
|
|
77
77
|
# Start a thread for each page, checking for broken links.
|
@@ -83,30 +83,31 @@ module BrokenLinkFinder
|
|
83
83
|
|
84
84
|
# Wait for all threads to finish.
|
85
85
|
pool.shutdown
|
86
|
+
retry_broken_links
|
86
87
|
|
87
88
|
sort_links
|
88
|
-
|
89
|
+
set_crawl_stats(url: url, pages_crawled: crawled.to_a, start: start)
|
89
90
|
|
90
|
-
|
91
|
+
@broken_links.any?
|
91
92
|
end
|
92
93
|
|
93
94
|
# Pretty prints the link report into a stream e.g. STDOUT or a file,
|
94
95
|
# anything that respond_to? :puts. Defaults to STDOUT.
|
95
|
-
|
96
|
-
|
97
|
-
|
98
|
-
|
99
|
-
|
100
|
-
|
101
|
-
|
102
|
-
|
103
|
-
|
104
|
-
|
105
|
-
|
106
|
-
|
107
|
-
|
108
|
-
|
109
|
-
|
96
|
+
def report(stream = STDOUT,
|
97
|
+
type: :text, broken_verbose: true, ignored_verbose: false)
|
98
|
+
klass = case type
|
99
|
+
when :text
|
100
|
+
BrokenLinkFinder::TextReporter
|
101
|
+
when :html
|
102
|
+
BrokenLinkFinder::HTMLReporter
|
103
|
+
else
|
104
|
+
raise "type: must be :text or :html, not: :#{type}"
|
105
|
+
end
|
106
|
+
|
107
|
+
reporter = klass.new(stream, @sort, @broken_links,
|
108
|
+
@ignored_links, @broken_link_map, @crawl_stats)
|
109
|
+
reporter.call(broken_verbose: broken_verbose,
|
110
|
+
ignored_verbose: ignored_verbose)
|
110
111
|
end
|
111
112
|
|
112
113
|
private
|
@@ -117,11 +118,11 @@ module BrokenLinkFinder
|
|
117
118
|
|
118
119
|
# Iterate over the supported links checking if they're broken or not.
|
119
120
|
links.each do |link|
|
120
|
-
#
|
121
|
+
# Skip if the link has been processed previously.
|
121
122
|
next if @all_intact_links.include?(link)
|
122
123
|
|
123
124
|
if @all_broken_links.include?(link)
|
124
|
-
append_broken_link(page.url, link)
|
125
|
+
append_broken_link(page.url, link) # Record on which page.
|
125
126
|
next
|
126
127
|
end
|
127
128
|
|
@@ -129,10 +130,8 @@ module BrokenLinkFinder
|
|
129
130
|
link_doc = crawl_link(page, link)
|
130
131
|
|
131
132
|
# Determine if the crawled link is broken or not.
|
132
|
-
if
|
133
|
-
|
134
|
-
has_broken_anchor(link_doc)
|
135
|
-
append_broken_link(page.url, link)
|
133
|
+
if link_broken?(link_doc)
|
134
|
+
append_broken_link(page.url, link, doc: page)
|
136
135
|
else
|
137
136
|
@lock.synchronize { @all_intact_links << link }
|
138
137
|
end
|
@@ -141,6 +140,17 @@ module BrokenLinkFinder
|
|
141
140
|
nil
|
142
141
|
end
|
143
142
|
|
143
|
+
# Implements a retry mechanism for each of the broken links found.
|
144
|
+
# Removes any broken links found to be working OK.
|
145
|
+
def retry_broken_links
|
146
|
+
sleep(0.5) # Give the servers a break, then retry the links.
|
147
|
+
|
148
|
+
@broken_link_map.each do |link, href|
|
149
|
+
doc = @crawler.crawl(href)
|
150
|
+
remove_broken_link(link) unless link_broken?(doc)
|
151
|
+
end
|
152
|
+
end
|
153
|
+
|
144
154
|
# Report and reject any non supported links. Any link that is absolute and
|
145
155
|
# doesn't start with 'http' is unsupported e.g. 'mailto:blah' etc.
|
146
156
|
def get_supported_links(doc)
|
@@ -153,12 +163,17 @@ module BrokenLinkFinder
|
|
153
163
|
end
|
154
164
|
end
|
155
165
|
|
156
|
-
#
|
166
|
+
# Make the link absolute and crawl it, returning its Wgit::Document.
|
157
167
|
def crawl_link(doc, link)
|
158
168
|
link = link.prefix_base(doc)
|
159
169
|
@crawler.crawl(link)
|
160
170
|
end
|
161
171
|
|
172
|
+
# Return if the crawled link is broken or not.
|
173
|
+
def link_broken?(doc)
|
174
|
+
doc.nil? || @crawler.last_response.not_found? || has_broken_anchor(doc)
|
175
|
+
end
|
176
|
+
|
162
177
|
# Returns true if the link is/contains a broken anchor/fragment.
|
163
178
|
def has_broken_anchor(doc)
|
164
179
|
raise 'link document is nil' unless doc
|
@@ -170,7 +185,8 @@ module BrokenLinkFinder
|
|
170
185
|
end
|
171
186
|
|
172
187
|
# Append key => [value] to @broken_links.
|
173
|
-
|
188
|
+
# If doc: is provided then the link will be recorded in absolute form.
|
189
|
+
def append_broken_link(url, link, doc: nil)
|
174
190
|
key, value = get_key_value(url, link)
|
175
191
|
|
176
192
|
@lock.synchronize do
|
@@ -178,6 +194,23 @@ module BrokenLinkFinder
|
|
178
194
|
@broken_links[key] << value
|
179
195
|
|
180
196
|
@all_broken_links << link
|
197
|
+
|
198
|
+
@broken_link_map[link] = link.prefix_base(doc) if doc
|
199
|
+
end
|
200
|
+
end
|
201
|
+
|
202
|
+
# Remove the broken_link from the necessary collections.
|
203
|
+
def remove_broken_link(link)
|
204
|
+
@lock.synchronize do
|
205
|
+
if @sort == :page
|
206
|
+
@broken_links.each { |_k, links| links.delete(link) }
|
207
|
+
@broken_links.delete_if { |_k, links| links.empty? }
|
208
|
+
else
|
209
|
+
@broken_links.delete(link)
|
210
|
+
end
|
211
|
+
|
212
|
+
@all_broken_links.delete(link)
|
213
|
+
@all_intact_links << link
|
181
214
|
end
|
182
215
|
end
|
183
216
|
|
@@ -217,12 +250,15 @@ module BrokenLinkFinder
|
|
217
250
|
end
|
218
251
|
|
219
252
|
# Sets and returns the total number of links crawled.
|
220
|
-
def
|
221
|
-
@
|
253
|
+
def set_crawl_stats(url:, pages_crawled:, start:)
|
254
|
+
@crawl_stats[:url] = url
|
255
|
+
@crawl_stats[:pages_crawled] = pages_crawled
|
256
|
+
@crawl_stats[:num_pages] = pages_crawled.size
|
257
|
+
@crawl_stats[:num_links] = @all_broken_links.size + @all_intact_links.size
|
258
|
+
@crawl_stats[:duration] = Time.now - start
|
222
259
|
end
|
223
260
|
|
224
|
-
alias crawl_page
|
225
|
-
alias crawl_r
|
226
|
-
alias pretty_print_link_summary pretty_print_link_report
|
261
|
+
alias crawl_page crawl_url
|
262
|
+
alias crawl_r crawl_site
|
227
263
|
end
|
228
264
|
end
|
@@ -0,0 +1,134 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module BrokenLinkFinder
|
4
|
+
class HTMLReporter < Reporter
|
5
|
+
# Creates a new HTMLReporter instance.
|
6
|
+
# stream is any Object that responds to :puts and :print.
|
7
|
+
def initialize(stream, sort,
|
8
|
+
broken_links, ignored_links,
|
9
|
+
broken_link_map, crawl_stats)
|
10
|
+
super
|
11
|
+
end
|
12
|
+
|
13
|
+
# Pretty print a report detailing the full link summary.
|
14
|
+
def call(broken_verbose: true, ignored_verbose: false)
|
15
|
+
puts '<div class="broken_link_finder_report">'
|
16
|
+
|
17
|
+
report_crawl_summary
|
18
|
+
report_broken_links(verbose: broken_verbose)
|
19
|
+
report_ignored_links(verbose: ignored_verbose)
|
20
|
+
|
21
|
+
puts '</div>'
|
22
|
+
|
23
|
+
nil
|
24
|
+
end
|
25
|
+
|
26
|
+
private
|
27
|
+
|
28
|
+
# Report a summary of the overall crawl.
|
29
|
+
def report_crawl_summary
|
30
|
+
puts format(
|
31
|
+
'<p class="crawl_summary">Crawled %s (%s page(s) in %s seconds)</p>',
|
32
|
+
@crawl_stats[:url],
|
33
|
+
@crawl_stats[:num_pages],
|
34
|
+
@crawl_stats[:duration]&.truncate(2)
|
35
|
+
)
|
36
|
+
end
|
37
|
+
|
38
|
+
# Report a summary of the broken links.
|
39
|
+
def report_broken_links(verbose: true)
|
40
|
+
puts '<div class="broken_links">'
|
41
|
+
|
42
|
+
if @broken_links.empty?
|
43
|
+
puts_summary 'Good news, there are no broken links!', type: :broken
|
44
|
+
else
|
45
|
+
num_pages, num_links = get_hash_stats(@broken_links)
|
46
|
+
puts_summary "Found #{num_links} broken link(s) across #{num_pages} page(s):", type: :broken
|
47
|
+
|
48
|
+
@broken_links.each do |key, values|
|
49
|
+
puts_group(key, type: :broken) # Puts the opening <p> element.
|
50
|
+
|
51
|
+
if verbose || (values.length <= NUM_VALUES)
|
52
|
+
values.each { |value| puts_group_item value, type: :broken }
|
53
|
+
else # Only print N values and summarise the rest.
|
54
|
+
NUM_VALUES.times { |i| puts_group_item values[i], type: :broken }
|
55
|
+
|
56
|
+
objects = sort_by_page? ? 'link(s)' : 'page(s)'
|
57
|
+
puts "+ #{values.length - NUM_VALUES} other #{objects}, remove --concise to see them all<br />"
|
58
|
+
end
|
59
|
+
|
60
|
+
puts '</p>'
|
61
|
+
end
|
62
|
+
end
|
63
|
+
|
64
|
+
puts '</div>'
|
65
|
+
end
|
66
|
+
|
67
|
+
# Report a summary of the ignored links.
|
68
|
+
def report_ignored_links(verbose: false)
|
69
|
+
puts '<div class="ignored_links">'
|
70
|
+
|
71
|
+
if @ignored_links.any?
|
72
|
+
num_pages, num_links = get_hash_stats(@ignored_links)
|
73
|
+
puts_summary "Ignored #{num_links} unsupported link(s) across #{num_pages} page(s), which you should check manually:", type: :ignored
|
74
|
+
|
75
|
+
@ignored_links.each do |key, values|
|
76
|
+
puts_group(key, type: :ignored) # Puts the opening <p> element.
|
77
|
+
|
78
|
+
if verbose || (values.length <= NUM_VALUES)
|
79
|
+
values.each { |value| puts_group_item value, type: :ignored }
|
80
|
+
else # Only print N values and summarise the rest.
|
81
|
+
NUM_VALUES.times { |i| puts_group_item values[i], type: :ignored }
|
82
|
+
|
83
|
+
objects = sort_by_page? ? 'link(s)' : 'page(s)'
|
84
|
+
puts "+ #{values.length - NUM_VALUES} other #{objects}, use --verbose to see them all<br />"
|
85
|
+
end
|
86
|
+
|
87
|
+
puts '</p>'
|
88
|
+
end
|
89
|
+
end
|
90
|
+
|
91
|
+
puts '</div>'
|
92
|
+
end
|
93
|
+
|
94
|
+
def puts_summary(text, type:)
|
95
|
+
klass = (type == :broken) ? 'broken_links_summary' : 'ignored_links_summary'
|
96
|
+
puts "<p class=\"#{klass}\">#{text}</p>"
|
97
|
+
end
|
98
|
+
|
99
|
+
def puts_group(link, type:)
|
100
|
+
href = build_url(link)
|
101
|
+
a_element = "<a href=\"#{href}\">#{link}</a>"
|
102
|
+
|
103
|
+
case type
|
104
|
+
when :broken
|
105
|
+
msg = sort_by_page? ?
|
106
|
+
"The following broken links were found on '#{a_element}':" :
|
107
|
+
"The broken link '#{a_element}' was found on the following pages:"
|
108
|
+
klass = 'broken_links_group'
|
109
|
+
when :ignored
|
110
|
+
msg = sort_by_page? ?
|
111
|
+
"The following links were ignored on '#{a_element}':" :
|
112
|
+
"The link '#{a_element}' was ignored on the following pages:"
|
113
|
+
klass = 'ignored_links_group'
|
114
|
+
else
|
115
|
+
raise "type: must be :broken or :ignored, not: #{type}"
|
116
|
+
end
|
117
|
+
|
118
|
+
puts "<p class=\"#{klass}\">"
|
119
|
+
puts msg + '<br />'
|
120
|
+
end
|
121
|
+
|
122
|
+
def puts_group_item(value, type:)
|
123
|
+
klass = (type == :broken) ? 'broken_links_group_item' : 'ignored_links_group_item'
|
124
|
+
puts "<a class=\"#{klass}\" href=\"#{build_url(value)}\">#{value}</a><br />"
|
125
|
+
end
|
126
|
+
|
127
|
+
def build_url(link)
|
128
|
+
return link if link.to_url.absolute?
|
129
|
+
@broken_link_map.fetch(link)
|
130
|
+
end
|
131
|
+
|
132
|
+
alias_method :report, :call
|
133
|
+
end
|
134
|
+
end
|
@@ -0,0 +1,77 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module BrokenLinkFinder
|
4
|
+
# Generic reporter class to be inherited from by format specific reporters.
|
5
|
+
class Reporter
|
6
|
+
# The amount of pages/links to display when verbose is false.
|
7
|
+
NUM_VALUES = 3
|
8
|
+
|
9
|
+
# Creates a new Reporter instance.
|
10
|
+
# stream is any Object that responds to :puts and :print.
|
11
|
+
def initialize(stream, sort,
|
12
|
+
broken_links, ignored_links,
|
13
|
+
broken_link_map, crawl_stats)
|
14
|
+
unless stream.respond_to?(:puts) && stream.respond_to?(:print)
|
15
|
+
raise 'stream must respond_to? :puts and :print'
|
16
|
+
end
|
17
|
+
raise "sort by either :page or :link, not #{sort}" \
|
18
|
+
unless %i[page link].include?(sort)
|
19
|
+
|
20
|
+
@stream = stream
|
21
|
+
@sort = sort
|
22
|
+
@broken_links = broken_links
|
23
|
+
@ignored_links = ignored_links
|
24
|
+
@broken_link_map = broken_link_map
|
25
|
+
@crawl_stats = crawl_stats
|
26
|
+
end
|
27
|
+
|
28
|
+
# Pretty print a report detailing the full link summary.
|
29
|
+
def call(broken_verbose: true, ignored_verbose: false)
|
30
|
+
raise 'Not implemented by parent class'
|
31
|
+
end
|
32
|
+
|
33
|
+
protected
|
34
|
+
|
35
|
+
# Return true if the sort is by page.
|
36
|
+
def sort_by_page?
|
37
|
+
@sort == :page
|
38
|
+
end
|
39
|
+
|
40
|
+
# Returns the key/value statistics of hash e.g. the number of keys and
|
41
|
+
# combined values. The hash should be of the format: { 'str' => [...] }.
|
42
|
+
# Use like: `num_pages, num_links = get_hash_stats(links)`.
|
43
|
+
def get_hash_stats(hash)
|
44
|
+
num_keys = hash.keys.length
|
45
|
+
values = hash.values.flatten
|
46
|
+
num_values = sort_by_page? ? values.length : values.uniq.length
|
47
|
+
|
48
|
+
sort_by_page? ?
|
49
|
+
[num_keys, num_values] :
|
50
|
+
[num_values, num_keys]
|
51
|
+
end
|
52
|
+
|
53
|
+
# Prints the text. Defaults to a blank line.
|
54
|
+
def print(text = '')
|
55
|
+
@stream.print(text)
|
56
|
+
end
|
57
|
+
|
58
|
+
# Prints the text + \n. Defaults to a blank line.
|
59
|
+
def puts(text = '')
|
60
|
+
@stream.puts(text)
|
61
|
+
end
|
62
|
+
|
63
|
+
# Prints text + \n\n.
|
64
|
+
def putsn(text)
|
65
|
+
puts(text)
|
66
|
+
puts
|
67
|
+
end
|
68
|
+
|
69
|
+
# Prints \n + text + \n.
|
70
|
+
def nputs(text)
|
71
|
+
puts
|
72
|
+
puts(text)
|
73
|
+
end
|
74
|
+
|
75
|
+
alias_method :report, :call
|
76
|
+
end
|
77
|
+
end
|
@@ -0,0 +1,86 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module BrokenLinkFinder
|
4
|
+
class TextReporter < Reporter
|
5
|
+
# Creates a new TextReporter instance.
|
6
|
+
# stream is any Object that responds to :puts and :print.
|
7
|
+
def initialize(stream, sort,
|
8
|
+
broken_links, ignored_links,
|
9
|
+
broken_link_map, crawl_stats)
|
10
|
+
super
|
11
|
+
end
|
12
|
+
|
13
|
+
# Pretty print a report detailing the full link summary.
|
14
|
+
def call(broken_verbose: true, ignored_verbose: false)
|
15
|
+
report_crawl_summary
|
16
|
+
report_broken_links(verbose: broken_verbose)
|
17
|
+
report_ignored_links(verbose: ignored_verbose)
|
18
|
+
|
19
|
+
nil
|
20
|
+
end
|
21
|
+
|
22
|
+
private
|
23
|
+
|
24
|
+
# Report a summary of the overall crawl.
|
25
|
+
def report_crawl_summary
|
26
|
+
putsn format(
|
27
|
+
'Crawled %s (%s page(s) in %s seconds)',
|
28
|
+
@crawl_stats[:url],
|
29
|
+
@crawl_stats[:num_pages],
|
30
|
+
@crawl_stats[:duration]&.truncate(2)
|
31
|
+
)
|
32
|
+
end
|
33
|
+
|
34
|
+
# Report a summary of the broken links.
|
35
|
+
def report_broken_links(verbose: true)
|
36
|
+
if @broken_links.empty?
|
37
|
+
puts 'Good news, there are no broken links!'
|
38
|
+
else
|
39
|
+
num_pages, num_links = get_hash_stats(@broken_links)
|
40
|
+
puts "Found #{num_links} broken link(s) across #{num_pages} page(s):"
|
41
|
+
|
42
|
+
@broken_links.each do |key, values|
|
43
|
+
msg = sort_by_page? ?
|
44
|
+
"The following broken links were found on '#{key}':" :
|
45
|
+
"The broken link '#{key}' was found on the following pages:"
|
46
|
+
nputs msg
|
47
|
+
|
48
|
+
if verbose || (values.length <= NUM_VALUES)
|
49
|
+
values.each { |value| puts value }
|
50
|
+
else # Only print N values and summarise the rest.
|
51
|
+
NUM_VALUES.times { |i| puts values[i] }
|
52
|
+
|
53
|
+
objects = sort_by_page? ? 'link(s)' : 'page(s)'
|
54
|
+
puts "+ #{values.length - NUM_VALUES} other #{objects}, remove --concise to see them all"
|
55
|
+
end
|
56
|
+
end
|
57
|
+
end
|
58
|
+
end
|
59
|
+
|
60
|
+
# Report a summary of the ignored links.
|
61
|
+
def report_ignored_links(verbose: false)
|
62
|
+
if @ignored_links.any?
|
63
|
+
num_pages, num_links = get_hash_stats(@ignored_links)
|
64
|
+
nputs "Ignored #{num_links} unsupported link(s) across #{num_pages} page(s), which you should check manually:"
|
65
|
+
|
66
|
+
@ignored_links.each do |key, values|
|
67
|
+
msg = sort_by_page? ?
|
68
|
+
"The following links were ignored on '#{key}':" :
|
69
|
+
"The link '#{key}' was ignored on the following pages:"
|
70
|
+
nputs msg
|
71
|
+
|
72
|
+
if verbose || (values.length <= NUM_VALUES)
|
73
|
+
values.each { |value| puts value }
|
74
|
+
else # Only print N values and summarise the rest.
|
75
|
+
NUM_VALUES.times { |i| puts values[i] }
|
76
|
+
|
77
|
+
objects = sort_by_page? ? 'link(s)' : 'page(s)'
|
78
|
+
puts "+ #{values.length - NUM_VALUES} other #{objects}, use --verbose to see them all"
|
79
|
+
end
|
80
|
+
end
|
81
|
+
end
|
82
|
+
end
|
83
|
+
|
84
|
+
alias_method :report, :call
|
85
|
+
end
|
86
|
+
end
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: broken_link_finder
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
4
|
+
version: 0.10.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Michael Telford
|
8
8
|
autorequire:
|
9
9
|
bindir: exe
|
10
10
|
cert_chain: []
|
11
|
-
date: 2019-11-
|
11
|
+
date: 2019-11-28 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: bundler
|
@@ -159,7 +159,9 @@ files:
|
|
159
159
|
- exe/broken_link_finder
|
160
160
|
- lib/broken_link_finder.rb
|
161
161
|
- lib/broken_link_finder/finder.rb
|
162
|
-
- lib/broken_link_finder/reporter.rb
|
162
|
+
- lib/broken_link_finder/reporter/html_reporter.rb
|
163
|
+
- lib/broken_link_finder/reporter/reporter.rb
|
164
|
+
- lib/broken_link_finder/reporter/text_reporter.rb
|
163
165
|
- lib/broken_link_finder/version.rb
|
164
166
|
- lib/broken_link_finder/wgit_extensions.rb
|
165
167
|
- load.rb
|
@@ -187,8 +189,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
187
189
|
- !ruby/object:Gem::Version
|
188
190
|
version: '0'
|
189
191
|
requirements: []
|
190
|
-
|
191
|
-
rubygems_version: 2.7.6
|
192
|
+
rubygems_version: 3.0.6
|
192
193
|
signing_key:
|
193
194
|
specification_version: 4
|
194
195
|
summary: Finds a website's broken links and reports back to you with a summary.
|
@@ -1,116 +0,0 @@
|
|
1
|
-
# frozen_string_literal: true
|
2
|
-
|
3
|
-
module BrokenLinkFinder
|
4
|
-
class Reporter
|
5
|
-
# The amount of pages/links to display when verbose is false.
|
6
|
-
NUM_VALUES = 3
|
7
|
-
|
8
|
-
# Creates a new Reporter instance.
|
9
|
-
# stream is any Object that responds to :puts.
|
10
|
-
def initialize(stream, sort, broken_links, ignored_links)
|
11
|
-
raise 'stream must respond_to? :puts' unless stream.respond_to?(:puts)
|
12
|
-
raise "sort by either :page or :link, not #{sort}" \
|
13
|
-
unless %i[page link].include?(sort)
|
14
|
-
|
15
|
-
@stream = stream
|
16
|
-
@sort = sort
|
17
|
-
@broken_links = broken_links
|
18
|
-
@ignored_links = ignored_links
|
19
|
-
end
|
20
|
-
|
21
|
-
# Pretty print a report detailing the link summary.
|
22
|
-
def pretty_print_link_report(broken_verbose: true, ignored_verbose: false)
|
23
|
-
report_broken_links(verbose: broken_verbose)
|
24
|
-
report_ignored_links(verbose: ignored_verbose)
|
25
|
-
|
26
|
-
nil
|
27
|
-
end
|
28
|
-
|
29
|
-
private
|
30
|
-
|
31
|
-
# Report a summary of the broken links.
|
32
|
-
def report_broken_links(verbose: true)
|
33
|
-
if @broken_links.empty?
|
34
|
-
print 'Good news, there are no broken links!'
|
35
|
-
else
|
36
|
-
num_pages, num_links = get_hash_stats(@broken_links)
|
37
|
-
print "Found #{num_links} broken link(s) across #{num_pages} page(s):"
|
38
|
-
|
39
|
-
@broken_links.each do |key, values|
|
40
|
-
msg = sort_by_page? ?
|
41
|
-
"The following broken links were found on '#{key}':" :
|
42
|
-
"The broken link '#{key}' was found on the following pages:"
|
43
|
-
nprint msg
|
44
|
-
|
45
|
-
if verbose || (values.length <= NUM_VALUES)
|
46
|
-
values.each { |value| print value }
|
47
|
-
else # Only print N values and summarise the rest.
|
48
|
-
NUM_VALUES.times { |i| print values[i] }
|
49
|
-
|
50
|
-
objects = sort_by_page? ? 'link(s)' : 'page(s)'
|
51
|
-
print "+ #{values.length - NUM_VALUES} other #{objects}, remove --concise to see them all"
|
52
|
-
end
|
53
|
-
end
|
54
|
-
end
|
55
|
-
end
|
56
|
-
|
57
|
-
# Report a summary of the ignored links.
|
58
|
-
def report_ignored_links(verbose: false)
|
59
|
-
if @ignored_links.any?
|
60
|
-
num_pages, num_links = get_hash_stats(@ignored_links)
|
61
|
-
nprint "Ignored #{num_links} unsupported link(s) across #{num_pages} page(s), which you should check manually:"
|
62
|
-
|
63
|
-
@ignored_links.each do |key, values|
|
64
|
-
msg = sort_by_page? ?
|
65
|
-
"The following links were ignored on '#{key}':" :
|
66
|
-
"The link '#{key}' was ignored on the following pages:"
|
67
|
-
nprint msg
|
68
|
-
|
69
|
-
if verbose || (values.length <= NUM_VALUES)
|
70
|
-
values.each { |value| print value }
|
71
|
-
else # Only print N values and summarise the rest.
|
72
|
-
NUM_VALUES.times { |i| print values[i] }
|
73
|
-
|
74
|
-
objects = sort_by_page? ? 'link(s)' : 'page(s)'
|
75
|
-
print "+ #{values.length - NUM_VALUES} other #{objects}, use --verbose to see them all"
|
76
|
-
end
|
77
|
-
end
|
78
|
-
end
|
79
|
-
end
|
80
|
-
|
81
|
-
# Return true if the sort is by page.
|
82
|
-
def sort_by_page?
|
83
|
-
@sort == :page
|
84
|
-
end
|
85
|
-
|
86
|
-
# Returns the key/value statistics of hash e.g. the number of keys and
|
87
|
-
# combined values. The hash should be of the format: { 'str' => [...] }.
|
88
|
-
# Use like: `num_pages, num_links = get_hash_stats(links)`.
|
89
|
-
def get_hash_stats(hash)
|
90
|
-
num_keys = hash.keys.length
|
91
|
-
values = hash.values.flatten
|
92
|
-
num_values = sort_by_page? ? values.length : values.uniq.length
|
93
|
-
|
94
|
-
sort_by_page? ?
|
95
|
-
[num_keys, num_values] :
|
96
|
-
[num_values, num_keys]
|
97
|
-
end
|
98
|
-
|
99
|
-
# Prints the text + \n. Defaults to a blank line.
|
100
|
-
def print(text = '')
|
101
|
-
@stream.puts(text)
|
102
|
-
end
|
103
|
-
|
104
|
-
# Prints text + \n\n.
|
105
|
-
def printn(text)
|
106
|
-
print(text)
|
107
|
-
print
|
108
|
-
end
|
109
|
-
|
110
|
-
# Prints \n + text + \n.
|
111
|
-
def nprint(text)
|
112
|
-
print
|
113
|
-
print(text)
|
114
|
-
end
|
115
|
-
end
|
116
|
-
end
|