site_checker 0.1.1 → 0.2.0.pre
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/.rbenv-version +1 -0
- data/.rspec +1 -0
- data/History.md +9 -0
- data/LICENSE +29 -0
- data/README.md +102 -0
- data/gem_tasks/rspec.rake +6 -0
- data/gem_tasks/yard.rake +6 -0
- data/lib/site_checker/dsl.rb +17 -0
- data/lib/site_checker/io/content_from_file_system.rb +43 -0
- data/lib/site_checker/io/content_from_web.rb +36 -0
- data/lib/site_checker/link.rb +60 -0
- data/lib/site_checker/link_collector.rb +153 -0
- data/lib/site_checker/parse/page.rb +82 -0
- data/lib/site_checker.rb +90 -206
- data/site_checker.gemspec +24 -0
- data/spec/dsl_spec.rb +37 -0
- data/spec/integration_spec.rb +191 -0
- data/spec/site_checker/io/content_from_file_system_spec.rb +61 -0
- data/spec/site_checker/io/content_from_web_spec.rb +46 -0
- data/spec/site_checker/io/io_spec_helper.rb +22 -0
- data/spec/site_checker/link_collector_spec.rb +41 -0
- data/spec/site_checker/link_spec.rb +94 -0
- data/spec/site_checker/parse/page_spec.rb +71 -0
- data/spec/site_checker/parse/parse_spec_helper.rb +8 -0
- data/spec/spec_helper.rb +10 -0
- metadata +134 -66
data/.rbenv-version
ADDED
@@ -0,0 +1 @@
|
|
1
|
+
1.9.3-p125
|
data/.rspec
ADDED
@@ -0,0 +1 @@
|
|
1
|
+
--color
|
data/History.md
ADDED
data/LICENSE
ADDED
@@ -0,0 +1,29 @@
|
|
1
|
+
New BSD License (3-clause license)
|
2
|
+
|
3
|
+
Copyright (c) 2012, Zsolt Fabok
|
4
|
+
All rights reserved.
|
5
|
+
|
6
|
+
Redistribution and use in source and binary forms, with or without
|
7
|
+
modification, are permitted provided that the following conditions are met:
|
8
|
+
|
9
|
+
* Redistributions of source code must retain the above copyright
|
10
|
+
notice, this list of conditions and the following disclaimer.
|
11
|
+
|
12
|
+
* Redistributions in binary form must reproduce the above copyright
|
13
|
+
notice, this list of conditions and the following disclaimer in the
|
14
|
+
documentation and/or other materials provided with the distribution.
|
15
|
+
|
16
|
+
* Neither the name of Zsolt Fabok nor the
|
17
|
+
names of its contributors may be used to endorse or promote products
|
18
|
+
derived from this software without specific prior written permission.
|
19
|
+
|
20
|
+
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
|
21
|
+
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
|
22
|
+
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
23
|
+
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER BE LIABLE FOR ANY
|
24
|
+
DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
|
25
|
+
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
|
26
|
+
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
|
27
|
+
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
|
28
|
+
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
|
29
|
+
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
data/README.md
ADDED
@@ -0,0 +1,102 @@
|
|
1
|
+
###Site Checker [](https://codeclimate.com/github/ZsoltFabok/site_checker) [](https://travis-ci.org/ZsoltFabok/site_checker) [](https://gemnasium.com/ZsoltFabok/site_checker)
|
2
|
+
|
3
|
+
|
4
|
+
Site Checker is a simple ruby gem, which helps you check the integrity of your website by recursively visiting the referenced pages and images. I use it in my test environments to make sure that my websites don't have any dead links.
|
5
|
+
|
6
|
+
### Install
|
7
|
+
|
8
|
+
gem install site_checker
|
9
|
+
|
10
|
+
### Usage
|
11
|
+
|
12
|
+
#### Default
|
13
|
+
|
14
|
+
First, you have to load the `site_checker` by adding this line to the file where you would like to use it:
|
15
|
+
|
16
|
+
require 'site_checker'
|
17
|
+
|
18
|
+
If you want to use it for testing, the line should goto the `test_helper.rb`.
|
19
|
+
|
20
|
+
The usage is quite simple:
|
21
|
+
|
22
|
+
check_site("http://localhost:3000/app", "http://localhost:3000")
|
23
|
+
puts collected_remote_pages.inspect
|
24
|
+
puts collected_local_pages.inspect
|
25
|
+
puts collected_remote_images.inspect
|
26
|
+
puts collected_local_images.inspect
|
27
|
+
puts collected_problems.inspect
|
28
|
+
|
29
|
+
The snippet above will open the `http://localhost:3000/app` link and will look for links and images. If it finds a link to a local page, it will recursively checkout out that page, too. The second argument - `http://localhost:3000` - defines the starting reference of your website.
|
30
|
+
|
31
|
+
In case you don't want to use a DSL like API you can still do the following:
|
32
|
+
|
33
|
+
SiteChecker.check("http://localhost:3000/app", "http://localhost:3000")
|
34
|
+
puts SiteChecker.remote_pages.inspect
|
35
|
+
puts SiteChecker.local_pages.inspect
|
36
|
+
puts SiteChecker.remote_images.inspect
|
37
|
+
puts SiteChecker.local_images.inspect
|
38
|
+
puts SiteChecker.problems.inspect
|
39
|
+
|
40
|
+
### Using on Generated Content
|
41
|
+
If you have a static website (e.g. generated by [octopress](https://github.com/imathis/octopress)) you can tell `site_checker` to use folders from the file system. With this approach, you don't need a webserver for verifying your website:
|
42
|
+
|
43
|
+
check_site("./public", "./public")
|
44
|
+
puts collected_problems.inspect
|
45
|
+
|
46
|
+
### Configuration
|
47
|
+
You can instruct `site_checker` to ignore certain links:
|
48
|
+
|
49
|
+
SiteChecker.configure do |config|
|
50
|
+
config.ignore_list = ["/", "/atom.xml"]
|
51
|
+
end
|
52
|
+
|
53
|
+
By default it won't check the conditions of the remote links and images - e.g. 404 or 500 -, but you can change it like this:
|
54
|
+
|
55
|
+
SiteChecker.configure do |config|
|
56
|
+
config.visit_references = true
|
57
|
+
end
|
58
|
+
|
59
|
+
Too deep recursive calls may be expensive, so you can configure the maximum depth of the recursion with the following attribute:
|
60
|
+
|
61
|
+
SiteChecker.configure do |config|
|
62
|
+
config.max_recursion_depth = 3
|
63
|
+
end
|
64
|
+
|
65
|
+
### Examples
|
66
|
+
Make sure that there are no local dead links on the website (I'm using [rspec](https://github.com/rspec/rspec) syntax):
|
67
|
+
|
68
|
+
before(:each) do
|
69
|
+
SiteChecker.configure do |config|
|
70
|
+
config.ignore_list = ["/atom.xml", "/rss"]
|
71
|
+
end
|
72
|
+
end
|
73
|
+
|
74
|
+
it "should not have dead local links" do
|
75
|
+
check_site("http://localhost:3000", "http://localhost:3000")
|
76
|
+
# this will print out the difference and I don't have to re-run with print
|
77
|
+
collected_problems.should be_empty
|
78
|
+
end
|
79
|
+
|
80
|
+
Check that all the local pages can be reached with maximum two steps:
|
81
|
+
|
82
|
+
before(:each) do
|
83
|
+
SiteChecker.configure do |config|
|
84
|
+
config.ignore_list = ["/atom.xml", "/rss"]
|
85
|
+
config.max_recursion_depth = 2
|
86
|
+
end
|
87
|
+
|
88
|
+
@number_of_local_pages = 100
|
89
|
+
end
|
90
|
+
|
91
|
+
it "all the local pages have to be visited" do
|
92
|
+
check_site("http://localhost:3000", "http://localhost:3000")
|
93
|
+
collected_local_pages.size.should eq @number_of_local_pages
|
94
|
+
end
|
95
|
+
|
96
|
+
### Troubleshooting
|
97
|
+
#### undefined method 'new' for SiteChecker:Module
|
98
|
+
This error occurs when the test code calls v0.1.1 methods, but a newer version of the gem has already been installed. Update your test code following the examples above.
|
99
|
+
|
100
|
+
### Copyright
|
101
|
+
|
102
|
+
Copyright (c) 2012 Zsolt Fabok and Contributors. See LICENSE for details.
|
data/gem_tasks/yard.rake
ADDED
@@ -0,0 +1,17 @@
|
|
1
|
+
module SiteChecker
|
2
|
+
module DSL
|
3
|
+
{ :check_site => :check,
|
4
|
+
:collected_local_pages => :local_pages,
|
5
|
+
:collected_remote_pages => :remote_pages,
|
6
|
+
:collected_local_images => :local_images,
|
7
|
+
:collected_remote_images => :remote_images,
|
8
|
+
:collected_problems => :problems
|
9
|
+
}.each do |dsl_method, method|
|
10
|
+
define_method dsl_method do |*args, &block|
|
11
|
+
SiteChecker.send method, *args, &block
|
12
|
+
end
|
13
|
+
end
|
14
|
+
end
|
15
|
+
end
|
16
|
+
|
17
|
+
include SiteChecker::DSL
|
@@ -0,0 +1,43 @@
|
|
1
|
+
module SiteChecker
|
2
|
+
module IO
|
3
|
+
class ContentFromFileSystem
|
4
|
+
|
5
|
+
def initialize(visit_references, root)
|
6
|
+
@visit_references = visit_references
|
7
|
+
@root = root
|
8
|
+
end
|
9
|
+
|
10
|
+
def get(link)
|
11
|
+
begin
|
12
|
+
location = create_absolute_reference(link.url)
|
13
|
+
if link.local_page?
|
14
|
+
content = File.open(add_index_html(location)).read
|
15
|
+
elsif link.local_image?
|
16
|
+
File.open(location)
|
17
|
+
elsif @visit_references
|
18
|
+
open(link.url)
|
19
|
+
end
|
20
|
+
rescue Errno::ENOENT => e
|
21
|
+
raise "(404 Not Found)"
|
22
|
+
rescue => e
|
23
|
+
raise "(#{e.message.strip})"
|
24
|
+
end
|
25
|
+
content
|
26
|
+
end
|
27
|
+
|
28
|
+
private
|
29
|
+
def add_index_html(path)
|
30
|
+
path = $1 if path.match(/(.+)#/)
|
31
|
+
path.end_with?(".html") ? path : File.join(path, "index.html")
|
32
|
+
end
|
33
|
+
|
34
|
+
def create_absolute_reference(link)
|
35
|
+
if !link.eql?(@root)
|
36
|
+
File.join(@root, link)
|
37
|
+
else
|
38
|
+
@root
|
39
|
+
end
|
40
|
+
end
|
41
|
+
end
|
42
|
+
end
|
43
|
+
end
|
@@ -0,0 +1,36 @@
|
|
1
|
+
module SiteChecker
|
2
|
+
module IO
|
3
|
+
class ContentFromWeb
|
4
|
+
|
5
|
+
def initialize(visit_references, root)
|
6
|
+
@visit_references = visit_references
|
7
|
+
@root = root
|
8
|
+
end
|
9
|
+
|
10
|
+
def get(link)
|
11
|
+
begin
|
12
|
+
uri = create_absolute_reference(link.url)
|
13
|
+
if link.local_page?
|
14
|
+
content = open(uri)
|
15
|
+
elsif link.local_image?
|
16
|
+
open(uri)
|
17
|
+
elsif @visit_references
|
18
|
+
open(uri)
|
19
|
+
end
|
20
|
+
rescue => e
|
21
|
+
raise "(#{e.message.strip})"
|
22
|
+
end
|
23
|
+
content
|
24
|
+
end
|
25
|
+
|
26
|
+
private
|
27
|
+
def create_absolute_reference(link)
|
28
|
+
if link.start_with?(@root)
|
29
|
+
URI(link)
|
30
|
+
else
|
31
|
+
URI(@root).merge(link)
|
32
|
+
end
|
33
|
+
end
|
34
|
+
end
|
35
|
+
end
|
36
|
+
end
|
@@ -0,0 +1,60 @@
|
|
1
|
+
module SiteChecker
|
2
|
+
class Link
|
3
|
+
attr_accessor :url
|
4
|
+
attr_accessor :parent_url
|
5
|
+
attr_accessor :kind
|
6
|
+
attr_accessor :location
|
7
|
+
attr_accessor :problem
|
8
|
+
|
9
|
+
def eql?(other)
|
10
|
+
ignore_trailing_slash(@url).eql? ignore_trailing_slash(other.url)
|
11
|
+
end
|
12
|
+
|
13
|
+
def ==(other)
|
14
|
+
eql?(other)
|
15
|
+
end
|
16
|
+
|
17
|
+
def hash
|
18
|
+
ignore_trailing_slash(@url).hash
|
19
|
+
end
|
20
|
+
|
21
|
+
def self.create(attrs)
|
22
|
+
link = Link.new
|
23
|
+
attrs.each do |key, value|
|
24
|
+
if self.instance_methods.include?("#{key}=".to_sym)
|
25
|
+
eval("link.#{key}=value")
|
26
|
+
end
|
27
|
+
end
|
28
|
+
link
|
29
|
+
end
|
30
|
+
|
31
|
+
def has_problem?
|
32
|
+
@problem != nil
|
33
|
+
end
|
34
|
+
|
35
|
+
def local_page?
|
36
|
+
@location == :local && @kind == :page
|
37
|
+
end
|
38
|
+
|
39
|
+
def local_image?
|
40
|
+
@location == :local && @kind == :image
|
41
|
+
end
|
42
|
+
|
43
|
+
def anchor?
|
44
|
+
@kind == :anchor
|
45
|
+
end
|
46
|
+
|
47
|
+
def anchor_ref?
|
48
|
+
@kind == :anchor_ref
|
49
|
+
end
|
50
|
+
|
51
|
+
def anchor_related?
|
52
|
+
anchor? || anchor_ref?
|
53
|
+
end
|
54
|
+
|
55
|
+
private
|
56
|
+
def ignore_trailing_slash(url)
|
57
|
+
url.gsub(/^\//,"")
|
58
|
+
end
|
59
|
+
end
|
60
|
+
end
|
@@ -0,0 +1,153 @@
|
|
1
|
+
module SiteChecker
|
2
|
+
class LinkCollector
|
3
|
+
attr_accessor :ignore_list, :visit_references, :max_recursion_depth
|
4
|
+
|
5
|
+
def initialize
|
6
|
+
yield self if block_given?
|
7
|
+
@ignore_list ||= []
|
8
|
+
@visit_references ||= false
|
9
|
+
@max_recursion_depth ||= -1
|
10
|
+
end
|
11
|
+
|
12
|
+
def check(url, root)
|
13
|
+
@links = []
|
14
|
+
@recursion_depth = 0
|
15
|
+
@root = root
|
16
|
+
|
17
|
+
@content_reader = get_content_reader
|
18
|
+
|
19
|
+
link = Link.create({:url => url, :kind => :page, :location => :local})
|
20
|
+
register_visit(link)
|
21
|
+
process_local_page(link)
|
22
|
+
evaluate_anchors
|
23
|
+
end
|
24
|
+
|
25
|
+
def local_pages
|
26
|
+
get_urls(:local, :page)
|
27
|
+
end
|
28
|
+
|
29
|
+
def remote_pages
|
30
|
+
get_urls(:remote, :page)
|
31
|
+
end
|
32
|
+
|
33
|
+
def local_images
|
34
|
+
get_urls(:local, :image)
|
35
|
+
end
|
36
|
+
|
37
|
+
def remote_images
|
38
|
+
get_urls(:remote, :image)
|
39
|
+
end
|
40
|
+
|
41
|
+
def problems
|
42
|
+
problems = {}
|
43
|
+
@links.each do |link|
|
44
|
+
if link.has_problem?
|
45
|
+
problems[link.parent_url] ||= []
|
46
|
+
problems[link.parent_url] << "#{link.url} #{link.problem}"
|
47
|
+
end
|
48
|
+
end
|
49
|
+
problems
|
50
|
+
end
|
51
|
+
|
52
|
+
private
|
53
|
+
def get_content_reader
|
54
|
+
if URI(@root).absolute?
|
55
|
+
SiteChecker::IO::ContentFromWeb.new(@visit_references, @root)
|
56
|
+
else
|
57
|
+
SiteChecker::IO::ContentFromFileSystem.new(@visit_references, @root)
|
58
|
+
end
|
59
|
+
end
|
60
|
+
|
61
|
+
def get_urls(location, kind)
|
62
|
+
@links.find_all do |link|
|
63
|
+
if link.location == location && link.kind == kind
|
64
|
+
link
|
65
|
+
end
|
66
|
+
end.map do |link|
|
67
|
+
link.url
|
68
|
+
end
|
69
|
+
end
|
70
|
+
|
71
|
+
def process_local_page(parent)
|
72
|
+
links = collect_links(parent)
|
73
|
+
|
74
|
+
links.each do |link|
|
75
|
+
link.parent_url = parent.url
|
76
|
+
unless link.anchor_related?
|
77
|
+
visit(link) unless visited?(link)
|
78
|
+
else
|
79
|
+
@links << link
|
80
|
+
end
|
81
|
+
end
|
82
|
+
end
|
83
|
+
|
84
|
+
def register_visit(link)
|
85
|
+
@links << link unless visited?(link)
|
86
|
+
end
|
87
|
+
|
88
|
+
def visited?(link)
|
89
|
+
@links.include?(link)
|
90
|
+
end
|
91
|
+
|
92
|
+
def visit(link)
|
93
|
+
register_visit(link)
|
94
|
+
unless link.local_page?
|
95
|
+
open_reference(link)
|
96
|
+
else
|
97
|
+
unless stop_recursion?
|
98
|
+
@recursion_depth += 1
|
99
|
+
process_local_page(link)
|
100
|
+
@recursion_depth -= 1
|
101
|
+
end
|
102
|
+
end
|
103
|
+
end
|
104
|
+
|
105
|
+
def open_reference(link)
|
106
|
+
content = nil
|
107
|
+
begin
|
108
|
+
content = @content_reader.get(link)
|
109
|
+
rescue => e
|
110
|
+
link.problem = "#{e.message.strip}"
|
111
|
+
end
|
112
|
+
content
|
113
|
+
end
|
114
|
+
|
115
|
+
def collect_links(link)
|
116
|
+
content = open_reference(link)
|
117
|
+
return SiteChecker::Parse::Page.parse(content, @ignore_list, @root)
|
118
|
+
end
|
119
|
+
|
120
|
+
def stop_recursion?
|
121
|
+
if @max_recursion_depth == -1
|
122
|
+
false
|
123
|
+
elsif @max_recursion_depth > @recursion_depth
|
124
|
+
false
|
125
|
+
else
|
126
|
+
true
|
127
|
+
end
|
128
|
+
end
|
129
|
+
|
130
|
+
def evaluate_anchors
|
131
|
+
anchors = @links.find_all {|link| link.anchor?}
|
132
|
+
anchor_references = @links.find_all {|link| link.anchor_ref?}
|
133
|
+
anchor_references.each do |anchor_ref|
|
134
|
+
if find_matching_anchor(anchors, anchor_ref).empty?
|
135
|
+
anchor_ref.problem = "(404 Not Found)"
|
136
|
+
end
|
137
|
+
end
|
138
|
+
end
|
139
|
+
|
140
|
+
def find_matching_anchor(anchors, anchor_ref)
|
141
|
+
result = []
|
142
|
+
anchors.each do |anchor|
|
143
|
+
if (anchor.parent_url == anchor_ref.parent_url &&
|
144
|
+
anchor_ref.url == "##{anchor.url}") ||
|
145
|
+
(anchor.parent_url != anchor_ref.parent_url &&
|
146
|
+
anchor_ref.url == "#{anchor.parent_url}##{anchor.url}")
|
147
|
+
result << anchor
|
148
|
+
end
|
149
|
+
end
|
150
|
+
result
|
151
|
+
end
|
152
|
+
end
|
153
|
+
end
|
@@ -0,0 +1,82 @@
|
|
1
|
+
module SiteChecker
|
2
|
+
module Parse
|
3
|
+
class Page
|
4
|
+
def self.parse(content, ignore_list, root)
|
5
|
+
links = []
|
6
|
+
page = Nokogiri(content)
|
7
|
+
|
8
|
+
links.concat(get_links(page, ignore_list, root))
|
9
|
+
links.concat(get_images(page, ignore_list, root))
|
10
|
+
links.concat(get_anchors(page))
|
11
|
+
links.concat(local_pages_which_has_anchor_references(links, root))
|
12
|
+
|
13
|
+
links.uniq
|
14
|
+
end
|
15
|
+
|
16
|
+
private
|
17
|
+
def self.get_links(page, ignore_list, root)
|
18
|
+
links = []
|
19
|
+
page.xpath("//a").reject {|a| ignored?(ignore_list, a['href'])}.each do |a|
|
20
|
+
if a['href'].match(/(.*)#.+/) && !URI($1).absolute?
|
21
|
+
kind = :anchor_ref
|
22
|
+
else
|
23
|
+
kind = :page
|
24
|
+
end
|
25
|
+
links << Link.create({:url => a['href'], :kind => kind})
|
26
|
+
end
|
27
|
+
set_location(links, root)
|
28
|
+
end
|
29
|
+
|
30
|
+
def self.get_images(page, ignore_list, root)
|
31
|
+
links = []
|
32
|
+
page.xpath("//img").reject {|img| ignored?(ignore_list, img['src'])}.each do |img|
|
33
|
+
links << Link.create({:url => img['src'], :kind => :image})
|
34
|
+
end
|
35
|
+
set_location(links, root)
|
36
|
+
end
|
37
|
+
|
38
|
+
def self.set_location(links, root)
|
39
|
+
links.each do |link|
|
40
|
+
uri = URI(link.url)
|
41
|
+
if uri.to_s.start_with?(root)
|
42
|
+
link.problem = "(absolute path)"
|
43
|
+
link.location = :local
|
44
|
+
else
|
45
|
+
if uri.absolute?
|
46
|
+
link.location = :remote
|
47
|
+
else
|
48
|
+
link.location = :local
|
49
|
+
end
|
50
|
+
end
|
51
|
+
end
|
52
|
+
end
|
53
|
+
|
54
|
+
def self.ignored?(ignore_list, link)
|
55
|
+
if link
|
56
|
+
ignore_list.include? link
|
57
|
+
else
|
58
|
+
true
|
59
|
+
end
|
60
|
+
end
|
61
|
+
|
62
|
+
def self.get_anchors(page)
|
63
|
+
anchors = []
|
64
|
+
page.xpath("//a").reject {|a| !a['id']}.each do |a|
|
65
|
+
anchors << Link.create({:url => a['id'], :kind => :anchor})
|
66
|
+
end
|
67
|
+
anchors
|
68
|
+
end
|
69
|
+
|
70
|
+
def self.local_pages_which_has_anchor_references(links, root)
|
71
|
+
new_links = []
|
72
|
+
links.find_all {|link| link.anchor_ref?}.each do |link|
|
73
|
+
uri = URI(link.url)
|
74
|
+
if link.url.match(/(.+)#/)
|
75
|
+
new_links << Link.create({:url => $1, :kind => :page})
|
76
|
+
end
|
77
|
+
end
|
78
|
+
set_location(new_links, root)
|
79
|
+
end
|
80
|
+
end
|
81
|
+
end
|
82
|
+
end
|