site_checker 0.1.1 → 0.2.0.pre
Sign up to get free protection for your applications and to get access to all the features.
- data/.rbenv-version +1 -0
- data/.rspec +1 -0
- data/History.md +9 -0
- data/LICENSE +29 -0
- data/README.md +102 -0
- data/gem_tasks/rspec.rake +6 -0
- data/gem_tasks/yard.rake +6 -0
- data/lib/site_checker/dsl.rb +17 -0
- data/lib/site_checker/io/content_from_file_system.rb +43 -0
- data/lib/site_checker/io/content_from_web.rb +36 -0
- data/lib/site_checker/link.rb +60 -0
- data/lib/site_checker/link_collector.rb +153 -0
- data/lib/site_checker/parse/page.rb +82 -0
- data/lib/site_checker.rb +90 -206
- data/site_checker.gemspec +24 -0
- data/spec/dsl_spec.rb +37 -0
- data/spec/integration_spec.rb +191 -0
- data/spec/site_checker/io/content_from_file_system_spec.rb +61 -0
- data/spec/site_checker/io/content_from_web_spec.rb +46 -0
- data/spec/site_checker/io/io_spec_helper.rb +22 -0
- data/spec/site_checker/link_collector_spec.rb +41 -0
- data/spec/site_checker/link_spec.rb +94 -0
- data/spec/site_checker/parse/page_spec.rb +71 -0
- data/spec/site_checker/parse/parse_spec_helper.rb +8 -0
- data/spec/spec_helper.rb +10 -0
- metadata +134 -66
data/.rbenv-version
ADDED
@@ -0,0 +1 @@
|
|
1
|
+
1.9.3-p125
|
data/.rspec
ADDED
@@ -0,0 +1 @@
|
|
1
|
+
--color
|
data/History.md
ADDED
data/LICENSE
ADDED
@@ -0,0 +1,29 @@
|
|
1
|
+
New BSD License (3-clause license)
|
2
|
+
|
3
|
+
Copyright (c) 2012, Zsolt Fabok
|
4
|
+
All rights reserved.
|
5
|
+
|
6
|
+
Redistribution and use in source and binary forms, with or without
|
7
|
+
modification, are permitted provided that the following conditions are met:
|
8
|
+
|
9
|
+
* Redistributions of source code must retain the above copyright
|
10
|
+
notice, this list of conditions and the following disclaimer.
|
11
|
+
|
12
|
+
* Redistributions in binary form must reproduce the above copyright
|
13
|
+
notice, this list of conditions and the following disclaimer in the
|
14
|
+
documentation and/or other materials provided with the distribution.
|
15
|
+
|
16
|
+
* Neither the name of Zsolt Fabok nor the
|
17
|
+
names of its contributors may be used to endorse or promote products
|
18
|
+
derived from this software without specific prior written permission.
|
19
|
+
|
20
|
+
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
|
21
|
+
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
|
22
|
+
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
23
|
+
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER BE LIABLE FOR ANY
|
24
|
+
DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
|
25
|
+
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
|
26
|
+
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
|
27
|
+
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
|
28
|
+
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
|
29
|
+
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
data/README.md
ADDED
@@ -0,0 +1,102 @@
|
|
1
|
+
###Site Checker [![Code Climate](https://codeclimate.com/badge.png)](https://codeclimate.com/github/ZsoltFabok/site_checker) [![Build Status](https://travis-ci.org/ZsoltFabok/site_checker.png)](https://travis-ci.org/ZsoltFabok/site_checker) [![Dependency Status](https://gemnasium.com/ZsoltFabok/site_checker.png)](https://gemnasium.com/ZsoltFabok/site_checker)
|
2
|
+
|
3
|
+
|
4
|
+
Site Checker is a simple ruby gem, which helps you check the integrity of your website by recursively visiting the referenced pages and images. I use it in my test environments to make sure that my websites don't have any dead links.
|
5
|
+
|
6
|
+
### Install
|
7
|
+
|
8
|
+
gem install site_checker
|
9
|
+
|
10
|
+
### Usage
|
11
|
+
|
12
|
+
#### Default
|
13
|
+
|
14
|
+
First, you have to load the `site_checker` by adding this line to the file where you would like to use it:
|
15
|
+
|
16
|
+
require 'site_checker'
|
17
|
+
|
18
|
+
If you want to use it for testing, the line should goto the `test_helper.rb`.
|
19
|
+
|
20
|
+
The usage is quite simple:
|
21
|
+
|
22
|
+
check_site("http://localhost:3000/app", "http://localhost:3000")
|
23
|
+
puts collected_remote_pages.inspect
|
24
|
+
puts collected_local_pages.inspect
|
25
|
+
puts collected_remote_images.inspect
|
26
|
+
puts collected_local_images.inspect
|
27
|
+
puts collected_problems.inspect
|
28
|
+
|
29
|
+
The snippet above will open the `http://localhost:3000/app` link and will look for links and images. If it finds a link to a local page, it will recursively checkout out that page, too. The second argument - `http://localhost:3000` - defines the starting reference of your website.
|
30
|
+
|
31
|
+
In case you don't want to use a DSL like API you can still do the following:
|
32
|
+
|
33
|
+
SiteChecker.check("http://localhost:3000/app", "http://localhost:3000")
|
34
|
+
puts SiteChecker.remote_pages.inspect
|
35
|
+
puts SiteChecker.local_pages.inspect
|
36
|
+
puts SiteChecker.remote_images.inspect
|
37
|
+
puts SiteChecker.local_images.inspect
|
38
|
+
puts SiteChecker.problems.inspect
|
39
|
+
|
40
|
+
### Using on Generated Content
|
41
|
+
If you have a static website (e.g. generated by [octopress](https://github.com/imathis/octopress)) you can tell `site_checker` to use folders from the file system. With this approach, you don't need a webserver for verifying your website:
|
42
|
+
|
43
|
+
check_site("./public", "./public")
|
44
|
+
puts collected_problems.inspect
|
45
|
+
|
46
|
+
### Configuration
|
47
|
+
You can instruct `site_checker` to ignore certain links:
|
48
|
+
|
49
|
+
SiteChecker.configure do |config|
|
50
|
+
config.ignore_list = ["/", "/atom.xml"]
|
51
|
+
end
|
52
|
+
|
53
|
+
By default it won't check the conditions of the remote links and images - e.g. 404 or 500 -, but you can change it like this:
|
54
|
+
|
55
|
+
SiteChecker.configure do |config|
|
56
|
+
config.visit_references = true
|
57
|
+
end
|
58
|
+
|
59
|
+
Too deep recursive calls may be expensive, so you can configure the maximum depth of the recursion with the following attribute:
|
60
|
+
|
61
|
+
SiteChecker.configure do |config|
|
62
|
+
config.max_recursion_depth = 3
|
63
|
+
end
|
64
|
+
|
65
|
+
### Examples
|
66
|
+
Make sure that there are no local dead links on the website (I'm using [rspec](https://github.com/rspec/rspec) syntax):
|
67
|
+
|
68
|
+
before(:each) do
|
69
|
+
SiteChecker.configure do |config|
|
70
|
+
config.ignore_list = ["/atom.xml", "/rss"]
|
71
|
+
end
|
72
|
+
end
|
73
|
+
|
74
|
+
it "should not have dead local links" do
|
75
|
+
check_site("http://localhost:3000", "http://localhost:3000")
|
76
|
+
# this will print out the difference and I don't have to re-run with print
|
77
|
+
collected_problems.should be_empty
|
78
|
+
end
|
79
|
+
|
80
|
+
Check that all the local pages can be reached with maximum two steps:
|
81
|
+
|
82
|
+
before(:each) do
|
83
|
+
SiteChecker.configure do |config|
|
84
|
+
config.ignore_list = ["/atom.xml", "/rss"]
|
85
|
+
config.max_recursion_depth = 2
|
86
|
+
end
|
87
|
+
|
88
|
+
@number_of_local_pages = 100
|
89
|
+
end
|
90
|
+
|
91
|
+
it "all the local pages have to be visited" do
|
92
|
+
check_site("http://localhost:3000", "http://localhost:3000")
|
93
|
+
collected_local_pages.size.should eq @number_of_local_pages
|
94
|
+
end
|
95
|
+
|
96
|
+
### Troubleshooting
|
97
|
+
#### undefined method 'new' for SiteChecker:Module
|
98
|
+
This error occurs when the test code calls v0.1.1 methods, but a newer version of the gem has already been installed. Update your test code following the examples above.
|
99
|
+
|
100
|
+
### Copyright
|
101
|
+
|
102
|
+
Copyright (c) 2012 Zsolt Fabok and Contributors. See LICENSE for details.
|
data/gem_tasks/yard.rake
ADDED
@@ -0,0 +1,17 @@
|
|
1
|
+
module SiteChecker
|
2
|
+
module DSL
|
3
|
+
{ :check_site => :check,
|
4
|
+
:collected_local_pages => :local_pages,
|
5
|
+
:collected_remote_pages => :remote_pages,
|
6
|
+
:collected_local_images => :local_images,
|
7
|
+
:collected_remote_images => :remote_images,
|
8
|
+
:collected_problems => :problems
|
9
|
+
}.each do |dsl_method, method|
|
10
|
+
define_method dsl_method do |*args, &block|
|
11
|
+
SiteChecker.send method, *args, &block
|
12
|
+
end
|
13
|
+
end
|
14
|
+
end
|
15
|
+
end
|
16
|
+
|
17
|
+
include SiteChecker::DSL
|
@@ -0,0 +1,43 @@
|
|
1
|
+
module SiteChecker
|
2
|
+
module IO
|
3
|
+
class ContentFromFileSystem
|
4
|
+
|
5
|
+
def initialize(visit_references, root)
|
6
|
+
@visit_references = visit_references
|
7
|
+
@root = root
|
8
|
+
end
|
9
|
+
|
10
|
+
def get(link)
|
11
|
+
begin
|
12
|
+
location = create_absolute_reference(link.url)
|
13
|
+
if link.local_page?
|
14
|
+
content = File.open(add_index_html(location)).read
|
15
|
+
elsif link.local_image?
|
16
|
+
File.open(location)
|
17
|
+
elsif @visit_references
|
18
|
+
open(link.url)
|
19
|
+
end
|
20
|
+
rescue Errno::ENOENT => e
|
21
|
+
raise "(404 Not Found)"
|
22
|
+
rescue => e
|
23
|
+
raise "(#{e.message.strip})"
|
24
|
+
end
|
25
|
+
content
|
26
|
+
end
|
27
|
+
|
28
|
+
private
|
29
|
+
def add_index_html(path)
|
30
|
+
path = $1 if path.match(/(.+)#/)
|
31
|
+
path.end_with?(".html") ? path : File.join(path, "index.html")
|
32
|
+
end
|
33
|
+
|
34
|
+
def create_absolute_reference(link)
|
35
|
+
if !link.eql?(@root)
|
36
|
+
File.join(@root, link)
|
37
|
+
else
|
38
|
+
@root
|
39
|
+
end
|
40
|
+
end
|
41
|
+
end
|
42
|
+
end
|
43
|
+
end
|
@@ -0,0 +1,36 @@
|
|
1
|
+
module SiteChecker
|
2
|
+
module IO
|
3
|
+
class ContentFromWeb
|
4
|
+
|
5
|
+
def initialize(visit_references, root)
|
6
|
+
@visit_references = visit_references
|
7
|
+
@root = root
|
8
|
+
end
|
9
|
+
|
10
|
+
def get(link)
|
11
|
+
begin
|
12
|
+
uri = create_absolute_reference(link.url)
|
13
|
+
if link.local_page?
|
14
|
+
content = open(uri)
|
15
|
+
elsif link.local_image?
|
16
|
+
open(uri)
|
17
|
+
elsif @visit_references
|
18
|
+
open(uri)
|
19
|
+
end
|
20
|
+
rescue => e
|
21
|
+
raise "(#{e.message.strip})"
|
22
|
+
end
|
23
|
+
content
|
24
|
+
end
|
25
|
+
|
26
|
+
private
|
27
|
+
def create_absolute_reference(link)
|
28
|
+
if link.start_with?(@root)
|
29
|
+
URI(link)
|
30
|
+
else
|
31
|
+
URI(@root).merge(link)
|
32
|
+
end
|
33
|
+
end
|
34
|
+
end
|
35
|
+
end
|
36
|
+
end
|
@@ -0,0 +1,60 @@
|
|
1
|
+
module SiteChecker
|
2
|
+
class Link
|
3
|
+
attr_accessor :url
|
4
|
+
attr_accessor :parent_url
|
5
|
+
attr_accessor :kind
|
6
|
+
attr_accessor :location
|
7
|
+
attr_accessor :problem
|
8
|
+
|
9
|
+
def eql?(other)
|
10
|
+
ignore_trailing_slash(@url).eql? ignore_trailing_slash(other.url)
|
11
|
+
end
|
12
|
+
|
13
|
+
def ==(other)
|
14
|
+
eql?(other)
|
15
|
+
end
|
16
|
+
|
17
|
+
def hash
|
18
|
+
ignore_trailing_slash(@url).hash
|
19
|
+
end
|
20
|
+
|
21
|
+
def self.create(attrs)
|
22
|
+
link = Link.new
|
23
|
+
attrs.each do |key, value|
|
24
|
+
if self.instance_methods.include?("#{key}=".to_sym)
|
25
|
+
eval("link.#{key}=value")
|
26
|
+
end
|
27
|
+
end
|
28
|
+
link
|
29
|
+
end
|
30
|
+
|
31
|
+
def has_problem?
|
32
|
+
@problem != nil
|
33
|
+
end
|
34
|
+
|
35
|
+
def local_page?
|
36
|
+
@location == :local && @kind == :page
|
37
|
+
end
|
38
|
+
|
39
|
+
def local_image?
|
40
|
+
@location == :local && @kind == :image
|
41
|
+
end
|
42
|
+
|
43
|
+
def anchor?
|
44
|
+
@kind == :anchor
|
45
|
+
end
|
46
|
+
|
47
|
+
def anchor_ref?
|
48
|
+
@kind == :anchor_ref
|
49
|
+
end
|
50
|
+
|
51
|
+
def anchor_related?
|
52
|
+
anchor? || anchor_ref?
|
53
|
+
end
|
54
|
+
|
55
|
+
private
|
56
|
+
def ignore_trailing_slash(url)
|
57
|
+
url.gsub(/^\//,"")
|
58
|
+
end
|
59
|
+
end
|
60
|
+
end
|
@@ -0,0 +1,153 @@
|
|
1
|
+
module SiteChecker
|
2
|
+
class LinkCollector
|
3
|
+
attr_accessor :ignore_list, :visit_references, :max_recursion_depth
|
4
|
+
|
5
|
+
def initialize
|
6
|
+
yield self if block_given?
|
7
|
+
@ignore_list ||= []
|
8
|
+
@visit_references ||= false
|
9
|
+
@max_recursion_depth ||= -1
|
10
|
+
end
|
11
|
+
|
12
|
+
def check(url, root)
|
13
|
+
@links = []
|
14
|
+
@recursion_depth = 0
|
15
|
+
@root = root
|
16
|
+
|
17
|
+
@content_reader = get_content_reader
|
18
|
+
|
19
|
+
link = Link.create({:url => url, :kind => :page, :location => :local})
|
20
|
+
register_visit(link)
|
21
|
+
process_local_page(link)
|
22
|
+
evaluate_anchors
|
23
|
+
end
|
24
|
+
|
25
|
+
def local_pages
|
26
|
+
get_urls(:local, :page)
|
27
|
+
end
|
28
|
+
|
29
|
+
def remote_pages
|
30
|
+
get_urls(:remote, :page)
|
31
|
+
end
|
32
|
+
|
33
|
+
def local_images
|
34
|
+
get_urls(:local, :image)
|
35
|
+
end
|
36
|
+
|
37
|
+
def remote_images
|
38
|
+
get_urls(:remote, :image)
|
39
|
+
end
|
40
|
+
|
41
|
+
def problems
|
42
|
+
problems = {}
|
43
|
+
@links.each do |link|
|
44
|
+
if link.has_problem?
|
45
|
+
problems[link.parent_url] ||= []
|
46
|
+
problems[link.parent_url] << "#{link.url} #{link.problem}"
|
47
|
+
end
|
48
|
+
end
|
49
|
+
problems
|
50
|
+
end
|
51
|
+
|
52
|
+
private
|
53
|
+
def get_content_reader
|
54
|
+
if URI(@root).absolute?
|
55
|
+
SiteChecker::IO::ContentFromWeb.new(@visit_references, @root)
|
56
|
+
else
|
57
|
+
SiteChecker::IO::ContentFromFileSystem.new(@visit_references, @root)
|
58
|
+
end
|
59
|
+
end
|
60
|
+
|
61
|
+
def get_urls(location, kind)
|
62
|
+
@links.find_all do |link|
|
63
|
+
if link.location == location && link.kind == kind
|
64
|
+
link
|
65
|
+
end
|
66
|
+
end.map do |link|
|
67
|
+
link.url
|
68
|
+
end
|
69
|
+
end
|
70
|
+
|
71
|
+
def process_local_page(parent)
|
72
|
+
links = collect_links(parent)
|
73
|
+
|
74
|
+
links.each do |link|
|
75
|
+
link.parent_url = parent.url
|
76
|
+
unless link.anchor_related?
|
77
|
+
visit(link) unless visited?(link)
|
78
|
+
else
|
79
|
+
@links << link
|
80
|
+
end
|
81
|
+
end
|
82
|
+
end
|
83
|
+
|
84
|
+
def register_visit(link)
|
85
|
+
@links << link unless visited?(link)
|
86
|
+
end
|
87
|
+
|
88
|
+
def visited?(link)
|
89
|
+
@links.include?(link)
|
90
|
+
end
|
91
|
+
|
92
|
+
def visit(link)
|
93
|
+
register_visit(link)
|
94
|
+
unless link.local_page?
|
95
|
+
open_reference(link)
|
96
|
+
else
|
97
|
+
unless stop_recursion?
|
98
|
+
@recursion_depth += 1
|
99
|
+
process_local_page(link)
|
100
|
+
@recursion_depth -= 1
|
101
|
+
end
|
102
|
+
end
|
103
|
+
end
|
104
|
+
|
105
|
+
def open_reference(link)
|
106
|
+
content = nil
|
107
|
+
begin
|
108
|
+
content = @content_reader.get(link)
|
109
|
+
rescue => e
|
110
|
+
link.problem = "#{e.message.strip}"
|
111
|
+
end
|
112
|
+
content
|
113
|
+
end
|
114
|
+
|
115
|
+
def collect_links(link)
|
116
|
+
content = open_reference(link)
|
117
|
+
return SiteChecker::Parse::Page.parse(content, @ignore_list, @root)
|
118
|
+
end
|
119
|
+
|
120
|
+
def stop_recursion?
|
121
|
+
if @max_recursion_depth == -1
|
122
|
+
false
|
123
|
+
elsif @max_recursion_depth > @recursion_depth
|
124
|
+
false
|
125
|
+
else
|
126
|
+
true
|
127
|
+
end
|
128
|
+
end
|
129
|
+
|
130
|
+
def evaluate_anchors
|
131
|
+
anchors = @links.find_all {|link| link.anchor?}
|
132
|
+
anchor_references = @links.find_all {|link| link.anchor_ref?}
|
133
|
+
anchor_references.each do |anchor_ref|
|
134
|
+
if find_matching_anchor(anchors, anchor_ref).empty?
|
135
|
+
anchor_ref.problem = "(404 Not Found)"
|
136
|
+
end
|
137
|
+
end
|
138
|
+
end
|
139
|
+
|
140
|
+
def find_matching_anchor(anchors, anchor_ref)
|
141
|
+
result = []
|
142
|
+
anchors.each do |anchor|
|
143
|
+
if (anchor.parent_url == anchor_ref.parent_url &&
|
144
|
+
anchor_ref.url == "##{anchor.url}") ||
|
145
|
+
(anchor.parent_url != anchor_ref.parent_url &&
|
146
|
+
anchor_ref.url == "#{anchor.parent_url}##{anchor.url}")
|
147
|
+
result << anchor
|
148
|
+
end
|
149
|
+
end
|
150
|
+
result
|
151
|
+
end
|
152
|
+
end
|
153
|
+
end
|
@@ -0,0 +1,82 @@
|
|
1
|
+
module SiteChecker
|
2
|
+
module Parse
|
3
|
+
class Page
|
4
|
+
def self.parse(content, ignore_list, root)
|
5
|
+
links = []
|
6
|
+
page = Nokogiri(content)
|
7
|
+
|
8
|
+
links.concat(get_links(page, ignore_list, root))
|
9
|
+
links.concat(get_images(page, ignore_list, root))
|
10
|
+
links.concat(get_anchors(page))
|
11
|
+
links.concat(local_pages_which_has_anchor_references(links, root))
|
12
|
+
|
13
|
+
links.uniq
|
14
|
+
end
|
15
|
+
|
16
|
+
private
|
17
|
+
def self.get_links(page, ignore_list, root)
|
18
|
+
links = []
|
19
|
+
page.xpath("//a").reject {|a| ignored?(ignore_list, a['href'])}.each do |a|
|
20
|
+
if a['href'].match(/(.*)#.+/) && !URI($1).absolute?
|
21
|
+
kind = :anchor_ref
|
22
|
+
else
|
23
|
+
kind = :page
|
24
|
+
end
|
25
|
+
links << Link.create({:url => a['href'], :kind => kind})
|
26
|
+
end
|
27
|
+
set_location(links, root)
|
28
|
+
end
|
29
|
+
|
30
|
+
def self.get_images(page, ignore_list, root)
|
31
|
+
links = []
|
32
|
+
page.xpath("//img").reject {|img| ignored?(ignore_list, img['src'])}.each do |img|
|
33
|
+
links << Link.create({:url => img['src'], :kind => :image})
|
34
|
+
end
|
35
|
+
set_location(links, root)
|
36
|
+
end
|
37
|
+
|
38
|
+
def self.set_location(links, root)
|
39
|
+
links.each do |link|
|
40
|
+
uri = URI(link.url)
|
41
|
+
if uri.to_s.start_with?(root)
|
42
|
+
link.problem = "(absolute path)"
|
43
|
+
link.location = :local
|
44
|
+
else
|
45
|
+
if uri.absolute?
|
46
|
+
link.location = :remote
|
47
|
+
else
|
48
|
+
link.location = :local
|
49
|
+
end
|
50
|
+
end
|
51
|
+
end
|
52
|
+
end
|
53
|
+
|
54
|
+
def self.ignored?(ignore_list, link)
|
55
|
+
if link
|
56
|
+
ignore_list.include? link
|
57
|
+
else
|
58
|
+
true
|
59
|
+
end
|
60
|
+
end
|
61
|
+
|
62
|
+
def self.get_anchors(page)
|
63
|
+
anchors = []
|
64
|
+
page.xpath("//a").reject {|a| !a['id']}.each do |a|
|
65
|
+
anchors << Link.create({:url => a['id'], :kind => :anchor})
|
66
|
+
end
|
67
|
+
anchors
|
68
|
+
end
|
69
|
+
|
70
|
+
def self.local_pages_which_has_anchor_references(links, root)
|
71
|
+
new_links = []
|
72
|
+
links.find_all {|link| link.anchor_ref?}.each do |link|
|
73
|
+
uri = URI(link.url)
|
74
|
+
if link.url.match(/(.+)#/)
|
75
|
+
new_links << Link.create({:url => $1, :kind => :page})
|
76
|
+
end
|
77
|
+
end
|
78
|
+
set_location(new_links, root)
|
79
|
+
end
|
80
|
+
end
|
81
|
+
end
|
82
|
+
end
|