site_checker 0.1.1 → 0.2.0.pre

Sign up to get free protection for your applications and to get access to all the features.
data/.rbenv-version ADDED
@@ -0,0 +1 @@
1
+ 1.9.3-p125
data/.rspec ADDED
@@ -0,0 +1 @@
1
+ --color
data/History.md ADDED
@@ -0,0 +1,9 @@
1
+ ## [v0.1.1](https://github.com/ZsoltFabok/site_checker/compare/v0.1.0...v0.1.1)
2
+
3
+ ### Fixes
4
+ * better dependency description in the gemspec (Zsolt Fabok)
5
+
6
+ ## [v0.1.0](https://github.com/ZsoltFabok/site_checker/tree/v0.1.0)
7
+
8
+ ### Notes
9
+ First version
data/LICENSE ADDED
@@ -0,0 +1,29 @@
1
+ New BSD License (3-clause license)
2
+
3
+ Copyright (c) 2012, Zsolt Fabok
4
+ All rights reserved.
5
+
6
+ Redistribution and use in source and binary forms, with or without
7
+ modification, are permitted provided that the following conditions are met:
8
+
9
+ * Redistributions of source code must retain the above copyright
10
+ notice, this list of conditions and the following disclaimer.
11
+
12
+ * Redistributions in binary form must reproduce the above copyright
13
+ notice, this list of conditions and the following disclaimer in the
14
+ documentation and/or other materials provided with the distribution.
15
+
16
+ * Neither the name of Zsolt Fabok nor the
17
+ names of its contributors may be used to endorse or promote products
18
+ derived from this software without specific prior written permission.
19
+
20
+ THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
21
+ ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
22
+ WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
23
+ DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER BE LIABLE FOR ANY
24
+ DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
25
+ (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
26
+ LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
27
+ ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
28
+ (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
29
+ SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
data/README.md ADDED
@@ -0,0 +1,102 @@
1
+ ###Site Checker [![Code Climate](https://codeclimate.com/badge.png)](https://codeclimate.com/github/ZsoltFabok/site_checker) [![Build Status](https://travis-ci.org/ZsoltFabok/site_checker.png)](https://travis-ci.org/ZsoltFabok/site_checker) [![Dependency Status](https://gemnasium.com/ZsoltFabok/site_checker.png)](https://gemnasium.com/ZsoltFabok/site_checker)
2
+
3
+
4
+ Site Checker is a simple ruby gem, which helps you check the integrity of your website by recursively visiting the referenced pages and images. I use it in my test environments to make sure that my websites don't have any dead links.
5
+
6
+ ### Install
7
+
8
+ gem install site_checker
9
+
10
+ ### Usage
11
+
12
+ #### Default
13
+
14
+ First, you have to load the `site_checker` by adding this line to the file where you would like to use it:
15
+
16
+ require 'site_checker'
17
+
18
+ If you want to use it for testing, the line should goto the `test_helper.rb`.
19
+
20
+ The usage is quite simple:
21
+
22
+ check_site("http://localhost:3000/app", "http://localhost:3000")
23
+ puts collected_remote_pages.inspect
24
+ puts collected_local_pages.inspect
25
+ puts collected_remote_images.inspect
26
+ puts collected_local_images.inspect
27
+ puts collected_problems.inspect
28
+
29
+ The snippet above will open the `http://localhost:3000/app` link and will look for links and images. If it finds a link to a local page, it will recursively checkout out that page, too. The second argument - `http://localhost:3000` - defines the starting reference of your website.
30
+
31
+ In case you don't want to use a DSL like API you can still do the following:
32
+
33
+ SiteChecker.check("http://localhost:3000/app", "http://localhost:3000")
34
+ puts SiteChecker.remote_pages.inspect
35
+ puts SiteChecker.local_pages.inspect
36
+ puts SiteChecker.remote_images.inspect
37
+ puts SiteChecker.local_images.inspect
38
+ puts SiteChecker.problems.inspect
39
+
40
+ ### Using on Generated Content
41
+ If you have a static website (e.g. generated by [octopress](https://github.com/imathis/octopress)) you can tell `site_checker` to use folders from the file system. With this approach, you don't need a webserver for verifying your website:
42
+
43
+ check_site("./public", "./public")
44
+ puts collected_problems.inspect
45
+
46
+ ### Configuration
47
+ You can instruct `site_checker` to ignore certain links:
48
+
49
+ SiteChecker.configure do |config|
50
+ config.ignore_list = ["/", "/atom.xml"]
51
+ end
52
+
53
+ By default it won't check the conditions of the remote links and images - e.g. 404 or 500 -, but you can change it like this:
54
+
55
+ SiteChecker.configure do |config|
56
+ config.visit_references = true
57
+ end
58
+
59
+ Too deep recursive calls may be expensive, so you can configure the maximum depth of the recursion with the following attribute:
60
+
61
+ SiteChecker.configure do |config|
62
+ config.max_recursion_depth = 3
63
+ end
64
+
65
+ ### Examples
66
+ Make sure that there are no local dead links on the website (I'm using [rspec](https://github.com/rspec/rspec) syntax):
67
+
68
+ before(:each) do
69
+ SiteChecker.configure do |config|
70
+ config.ignore_list = ["/atom.xml", "/rss"]
71
+ end
72
+ end
73
+
74
+ it "should not have dead local links" do
75
+ check_site("http://localhost:3000", "http://localhost:3000")
76
+ # this will print out the difference and I don't have to re-run with print
77
+ collected_problems.should be_empty
78
+ end
79
+
80
+ Check that all the local pages can be reached with maximum two steps:
81
+
82
+ before(:each) do
83
+ SiteChecker.configure do |config|
84
+ config.ignore_list = ["/atom.xml", "/rss"]
85
+ config.max_recursion_depth = 2
86
+ end
87
+
88
+ @number_of_local_pages = 100
89
+ end
90
+
91
+ it "all the local pages have to be visited" do
92
+ check_site("http://localhost:3000", "http://localhost:3000")
93
+ collected_local_pages.size.should eq @number_of_local_pages
94
+ end
95
+
96
+ ### Troubleshooting
97
+ #### undefined method 'new' for SiteChecker:Module
98
+ This error occurs when the test code calls v0.1.1 methods, but a newer version of the gem has already been installed. Update your test code following the examples above.
99
+
100
+ ### Copyright
101
+
102
+ Copyright (c) 2012 Zsolt Fabok and Contributors. See LICENSE for details.
@@ -0,0 +1,6 @@
1
+ require 'rspec/core/rake_task'
2
+
3
+ RSpec::Core::RakeTask.new(:spec)
4
+
5
+ desc "By default run the test cases"
6
+ task :default => :spec
@@ -0,0 +1,6 @@
1
+ require 'yard'
2
+ require 'yard/rake/yardoc_task'
3
+
4
+ YARD::Rake::YardocTask.new do |t|
5
+ t.files = ['lib/**/*.rb']
6
+ end
@@ -0,0 +1,17 @@
1
+ module SiteChecker
2
+ module DSL
3
+ { :check_site => :check,
4
+ :collected_local_pages => :local_pages,
5
+ :collected_remote_pages => :remote_pages,
6
+ :collected_local_images => :local_images,
7
+ :collected_remote_images => :remote_images,
8
+ :collected_problems => :problems
9
+ }.each do |dsl_method, method|
10
+ define_method dsl_method do |*args, &block|
11
+ SiteChecker.send method, *args, &block
12
+ end
13
+ end
14
+ end
15
+ end
16
+
17
+ include SiteChecker::DSL
@@ -0,0 +1,43 @@
1
+ module SiteChecker
2
+ module IO
3
+ class ContentFromFileSystem
4
+
5
+ def initialize(visit_references, root)
6
+ @visit_references = visit_references
7
+ @root = root
8
+ end
9
+
10
+ def get(link)
11
+ begin
12
+ location = create_absolute_reference(link.url)
13
+ if link.local_page?
14
+ content = File.open(add_index_html(location)).read
15
+ elsif link.local_image?
16
+ File.open(location)
17
+ elsif @visit_references
18
+ open(link.url)
19
+ end
20
+ rescue Errno::ENOENT => e
21
+ raise "(404 Not Found)"
22
+ rescue => e
23
+ raise "(#{e.message.strip})"
24
+ end
25
+ content
26
+ end
27
+
28
+ private
29
+ def add_index_html(path)
30
+ path = $1 if path.match(/(.+)#/)
31
+ path.end_with?(".html") ? path : File.join(path, "index.html")
32
+ end
33
+
34
+ def create_absolute_reference(link)
35
+ if !link.eql?(@root)
36
+ File.join(@root, link)
37
+ else
38
+ @root
39
+ end
40
+ end
41
+ end
42
+ end
43
+ end
@@ -0,0 +1,36 @@
1
+ module SiteChecker
2
+ module IO
3
+ class ContentFromWeb
4
+
5
+ def initialize(visit_references, root)
6
+ @visit_references = visit_references
7
+ @root = root
8
+ end
9
+
10
+ def get(link)
11
+ begin
12
+ uri = create_absolute_reference(link.url)
13
+ if link.local_page?
14
+ content = open(uri)
15
+ elsif link.local_image?
16
+ open(uri)
17
+ elsif @visit_references
18
+ open(uri)
19
+ end
20
+ rescue => e
21
+ raise "(#{e.message.strip})"
22
+ end
23
+ content
24
+ end
25
+
26
+ private
27
+ def create_absolute_reference(link)
28
+ if link.start_with?(@root)
29
+ URI(link)
30
+ else
31
+ URI(@root).merge(link)
32
+ end
33
+ end
34
+ end
35
+ end
36
+ end
@@ -0,0 +1,60 @@
1
+ module SiteChecker
2
+ class Link
3
+ attr_accessor :url
4
+ attr_accessor :parent_url
5
+ attr_accessor :kind
6
+ attr_accessor :location
7
+ attr_accessor :problem
8
+
9
+ def eql?(other)
10
+ ignore_trailing_slash(@url).eql? ignore_trailing_slash(other.url)
11
+ end
12
+
13
+ def ==(other)
14
+ eql?(other)
15
+ end
16
+
17
+ def hash
18
+ ignore_trailing_slash(@url).hash
19
+ end
20
+
21
+ def self.create(attrs)
22
+ link = Link.new
23
+ attrs.each do |key, value|
24
+ if self.instance_methods.include?("#{key}=".to_sym)
25
+ eval("link.#{key}=value")
26
+ end
27
+ end
28
+ link
29
+ end
30
+
31
+ def has_problem?
32
+ @problem != nil
33
+ end
34
+
35
+ def local_page?
36
+ @location == :local && @kind == :page
37
+ end
38
+
39
+ def local_image?
40
+ @location == :local && @kind == :image
41
+ end
42
+
43
+ def anchor?
44
+ @kind == :anchor
45
+ end
46
+
47
+ def anchor_ref?
48
+ @kind == :anchor_ref
49
+ end
50
+
51
+ def anchor_related?
52
+ anchor? || anchor_ref?
53
+ end
54
+
55
+ private
56
+ def ignore_trailing_slash(url)
57
+ url.gsub(/^\//,"")
58
+ end
59
+ end
60
+ end
@@ -0,0 +1,153 @@
1
+ module SiteChecker
2
+ class LinkCollector
3
+ attr_accessor :ignore_list, :visit_references, :max_recursion_depth
4
+
5
+ def initialize
6
+ yield self if block_given?
7
+ @ignore_list ||= []
8
+ @visit_references ||= false
9
+ @max_recursion_depth ||= -1
10
+ end
11
+
12
+ def check(url, root)
13
+ @links = []
14
+ @recursion_depth = 0
15
+ @root = root
16
+
17
+ @content_reader = get_content_reader
18
+
19
+ link = Link.create({:url => url, :kind => :page, :location => :local})
20
+ register_visit(link)
21
+ process_local_page(link)
22
+ evaluate_anchors
23
+ end
24
+
25
+ def local_pages
26
+ get_urls(:local, :page)
27
+ end
28
+
29
+ def remote_pages
30
+ get_urls(:remote, :page)
31
+ end
32
+
33
+ def local_images
34
+ get_urls(:local, :image)
35
+ end
36
+
37
+ def remote_images
38
+ get_urls(:remote, :image)
39
+ end
40
+
41
+ def problems
42
+ problems = {}
43
+ @links.each do |link|
44
+ if link.has_problem?
45
+ problems[link.parent_url] ||= []
46
+ problems[link.parent_url] << "#{link.url} #{link.problem}"
47
+ end
48
+ end
49
+ problems
50
+ end
51
+
52
+ private
53
+ def get_content_reader
54
+ if URI(@root).absolute?
55
+ SiteChecker::IO::ContentFromWeb.new(@visit_references, @root)
56
+ else
57
+ SiteChecker::IO::ContentFromFileSystem.new(@visit_references, @root)
58
+ end
59
+ end
60
+
61
+ def get_urls(location, kind)
62
+ @links.find_all do |link|
63
+ if link.location == location && link.kind == kind
64
+ link
65
+ end
66
+ end.map do |link|
67
+ link.url
68
+ end
69
+ end
70
+
71
+ def process_local_page(parent)
72
+ links = collect_links(parent)
73
+
74
+ links.each do |link|
75
+ link.parent_url = parent.url
76
+ unless link.anchor_related?
77
+ visit(link) unless visited?(link)
78
+ else
79
+ @links << link
80
+ end
81
+ end
82
+ end
83
+
84
+ def register_visit(link)
85
+ @links << link unless visited?(link)
86
+ end
87
+
88
+ def visited?(link)
89
+ @links.include?(link)
90
+ end
91
+
92
+ def visit(link)
93
+ register_visit(link)
94
+ unless link.local_page?
95
+ open_reference(link)
96
+ else
97
+ unless stop_recursion?
98
+ @recursion_depth += 1
99
+ process_local_page(link)
100
+ @recursion_depth -= 1
101
+ end
102
+ end
103
+ end
104
+
105
+ def open_reference(link)
106
+ content = nil
107
+ begin
108
+ content = @content_reader.get(link)
109
+ rescue => e
110
+ link.problem = "#{e.message.strip}"
111
+ end
112
+ content
113
+ end
114
+
115
+ def collect_links(link)
116
+ content = open_reference(link)
117
+ return SiteChecker::Parse::Page.parse(content, @ignore_list, @root)
118
+ end
119
+
120
+ def stop_recursion?
121
+ if @max_recursion_depth == -1
122
+ false
123
+ elsif @max_recursion_depth > @recursion_depth
124
+ false
125
+ else
126
+ true
127
+ end
128
+ end
129
+
130
+ def evaluate_anchors
131
+ anchors = @links.find_all {|link| link.anchor?}
132
+ anchor_references = @links.find_all {|link| link.anchor_ref?}
133
+ anchor_references.each do |anchor_ref|
134
+ if find_matching_anchor(anchors, anchor_ref).empty?
135
+ anchor_ref.problem = "(404 Not Found)"
136
+ end
137
+ end
138
+ end
139
+
140
+ def find_matching_anchor(anchors, anchor_ref)
141
+ result = []
142
+ anchors.each do |anchor|
143
+ if (anchor.parent_url == anchor_ref.parent_url &&
144
+ anchor_ref.url == "##{anchor.url}") ||
145
+ (anchor.parent_url != anchor_ref.parent_url &&
146
+ anchor_ref.url == "#{anchor.parent_url}##{anchor.url}")
147
+ result << anchor
148
+ end
149
+ end
150
+ result
151
+ end
152
+ end
153
+ end
@@ -0,0 +1,82 @@
1
+ module SiteChecker
2
+ module Parse
3
+ class Page
4
+ def self.parse(content, ignore_list, root)
5
+ links = []
6
+ page = Nokogiri(content)
7
+
8
+ links.concat(get_links(page, ignore_list, root))
9
+ links.concat(get_images(page, ignore_list, root))
10
+ links.concat(get_anchors(page))
11
+ links.concat(local_pages_which_has_anchor_references(links, root))
12
+
13
+ links.uniq
14
+ end
15
+
16
+ private
17
+ def self.get_links(page, ignore_list, root)
18
+ links = []
19
+ page.xpath("//a").reject {|a| ignored?(ignore_list, a['href'])}.each do |a|
20
+ if a['href'].match(/(.*)#.+/) && !URI($1).absolute?
21
+ kind = :anchor_ref
22
+ else
23
+ kind = :page
24
+ end
25
+ links << Link.create({:url => a['href'], :kind => kind})
26
+ end
27
+ set_location(links, root)
28
+ end
29
+
30
+ def self.get_images(page, ignore_list, root)
31
+ links = []
32
+ page.xpath("//img").reject {|img| ignored?(ignore_list, img['src'])}.each do |img|
33
+ links << Link.create({:url => img['src'], :kind => :image})
34
+ end
35
+ set_location(links, root)
36
+ end
37
+
38
+ def self.set_location(links, root)
39
+ links.each do |link|
40
+ uri = URI(link.url)
41
+ if uri.to_s.start_with?(root)
42
+ link.problem = "(absolute path)"
43
+ link.location = :local
44
+ else
45
+ if uri.absolute?
46
+ link.location = :remote
47
+ else
48
+ link.location = :local
49
+ end
50
+ end
51
+ end
52
+ end
53
+
54
+ def self.ignored?(ignore_list, link)
55
+ if link
56
+ ignore_list.include? link
57
+ else
58
+ true
59
+ end
60
+ end
61
+
62
+ def self.get_anchors(page)
63
+ anchors = []
64
+ page.xpath("//a").reject {|a| !a['id']}.each do |a|
65
+ anchors << Link.create({:url => a['id'], :kind => :anchor})
66
+ end
67
+ anchors
68
+ end
69
+
70
+ def self.local_pages_which_has_anchor_references(links, root)
71
+ new_links = []
72
+ links.find_all {|link| link.anchor_ref?}.each do |link|
73
+ uri = URI(link.url)
74
+ if link.url.match(/(.+)#/)
75
+ new_links << Link.create({:url => $1, :kind => :page})
76
+ end
77
+ end
78
+ set_location(new_links, root)
79
+ end
80
+ end
81
+ end
82
+ end