site_checker 0.1.1 → 0.2.0.pre

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/.rbenv-version ADDED
@@ -0,0 +1 @@
1
+ 1.9.3-p125
data/.rspec ADDED
@@ -0,0 +1 @@
1
+ --color
data/History.md ADDED
@@ -0,0 +1,9 @@
1
+ ## [v0.1.1](https://github.com/ZsoltFabok/site_checker/compare/v0.1.0...v0.1.1)
2
+
3
+ ### Fixes
4
+ * better dependency description in the gemspec (Zsolt Fabok)
5
+
6
+ ## [v0.1.0](https://github.com/ZsoltFabok/site_checker/tree/v0.1.0)
7
+
8
+ ### Notes
9
+ First version
data/LICENSE ADDED
@@ -0,0 +1,29 @@
1
+ New BSD License (3-clause license)
2
+
3
+ Copyright (c) 2012, Zsolt Fabok
4
+ All rights reserved.
5
+
6
+ Redistribution and use in source and binary forms, with or without
7
+ modification, are permitted provided that the following conditions are met:
8
+
9
+ * Redistributions of source code must retain the above copyright
10
+ notice, this list of conditions and the following disclaimer.
11
+
12
+ * Redistributions in binary form must reproduce the above copyright
13
+ notice, this list of conditions and the following disclaimer in the
14
+ documentation and/or other materials provided with the distribution.
15
+
16
+ * Neither the name of Zsolt Fabok nor the
17
+ names of its contributors may be used to endorse or promote products
18
+ derived from this software without specific prior written permission.
19
+
20
+ THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
21
+ ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
22
+ WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
23
+ DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER BE LIABLE FOR ANY
24
+ DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
25
+ (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
26
+ LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
27
+ ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
28
+ (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
29
+ SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
data/README.md ADDED
@@ -0,0 +1,102 @@
1
+ ###Site Checker [![Code Climate](https://codeclimate.com/badge.png)](https://codeclimate.com/github/ZsoltFabok/site_checker) [![Build Status](https://travis-ci.org/ZsoltFabok/site_checker.png)](https://travis-ci.org/ZsoltFabok/site_checker) [![Dependency Status](https://gemnasium.com/ZsoltFabok/site_checker.png)](https://gemnasium.com/ZsoltFabok/site_checker)
2
+
3
+
4
+ Site Checker is a simple ruby gem, which helps you check the integrity of your website by recursively visiting the referenced pages and images. I use it in my test environments to make sure that my websites don't have any dead links.
5
+
6
+ ### Install
7
+
8
+ gem install site_checker
9
+
10
+ ### Usage
11
+
12
+ #### Default
13
+
14
+ First, you have to load the `site_checker` by adding this line to the file where you would like to use it:
15
+
16
+ require 'site_checker'
17
+
18
+ If you want to use it for testing, the line should goto the `test_helper.rb`.
19
+
20
+ The usage is quite simple:
21
+
22
+ check_site("http://localhost:3000/app", "http://localhost:3000")
23
+ puts collected_remote_pages.inspect
24
+ puts collected_local_pages.inspect
25
+ puts collected_remote_images.inspect
26
+ puts collected_local_images.inspect
27
+ puts collected_problems.inspect
28
+
29
+ The snippet above will open the `http://localhost:3000/app` link and will look for links and images. If it finds a link to a local page, it will recursively checkout out that page, too. The second argument - `http://localhost:3000` - defines the starting reference of your website.
30
+
31
+ In case you don't want to use a DSL like API you can still do the following:
32
+
33
+ SiteChecker.check("http://localhost:3000/app", "http://localhost:3000")
34
+ puts SiteChecker.remote_pages.inspect
35
+ puts SiteChecker.local_pages.inspect
36
+ puts SiteChecker.remote_images.inspect
37
+ puts SiteChecker.local_images.inspect
38
+ puts SiteChecker.problems.inspect
39
+
40
+ ### Using on Generated Content
41
+ If you have a static website (e.g. generated by [octopress](https://github.com/imathis/octopress)) you can tell `site_checker` to use folders from the file system. With this approach, you don't need a webserver for verifying your website:
42
+
43
+ check_site("./public", "./public")
44
+ puts collected_problems.inspect
45
+
46
+ ### Configuration
47
+ You can instruct `site_checker` to ignore certain links:
48
+
49
+ SiteChecker.configure do |config|
50
+ config.ignore_list = ["/", "/atom.xml"]
51
+ end
52
+
53
+ By default it won't check the conditions of the remote links and images - e.g. 404 or 500 -, but you can change it like this:
54
+
55
+ SiteChecker.configure do |config|
56
+ config.visit_references = true
57
+ end
58
+
59
+ Too deep recursive calls may be expensive, so you can configure the maximum depth of the recursion with the following attribute:
60
+
61
+ SiteChecker.configure do |config|
62
+ config.max_recursion_depth = 3
63
+ end
64
+
65
+ ### Examples
66
+ Make sure that there are no local dead links on the website (I'm using [rspec](https://github.com/rspec/rspec) syntax):
67
+
68
+ before(:each) do
69
+ SiteChecker.configure do |config|
70
+ config.ignore_list = ["/atom.xml", "/rss"]
71
+ end
72
+ end
73
+
74
+ it "should not have dead local links" do
75
+ check_site("http://localhost:3000", "http://localhost:3000")
76
+ # this will print out the difference and I don't have to re-run with print
77
+ collected_problems.should be_empty
78
+ end
79
+
80
+ Check that all the local pages can be reached with maximum two steps:
81
+
82
+ before(:each) do
83
+ SiteChecker.configure do |config|
84
+ config.ignore_list = ["/atom.xml", "/rss"]
85
+ config.max_recursion_depth = 2
86
+ end
87
+
88
+ @number_of_local_pages = 100
89
+ end
90
+
91
+ it "all the local pages have to be visited" do
92
+ check_site("http://localhost:3000", "http://localhost:3000")
93
+ collected_local_pages.size.should eq @number_of_local_pages
94
+ end
95
+
96
+ ### Troubleshooting
97
+ #### undefined method 'new' for SiteChecker:Module
98
+ This error occurs when the test code calls v0.1.1 methods, but a newer version of the gem has already been installed. Update your test code following the examples above.
99
+
100
+ ### Copyright
101
+
102
+ Copyright (c) 2012 Zsolt Fabok and Contributors. See LICENSE for details.
@@ -0,0 +1,6 @@
1
+ require 'rspec/core/rake_task'
2
+
3
+ RSpec::Core::RakeTask.new(:spec)
4
+
5
+ desc "By default run the test cases"
6
+ task :default => :spec
@@ -0,0 +1,6 @@
1
+ require 'yard'
2
+ require 'yard/rake/yardoc_task'
3
+
4
+ YARD::Rake::YardocTask.new do |t|
5
+ t.files = ['lib/**/*.rb']
6
+ end
@@ -0,0 +1,17 @@
1
+ module SiteChecker
2
+ module DSL
3
+ { :check_site => :check,
4
+ :collected_local_pages => :local_pages,
5
+ :collected_remote_pages => :remote_pages,
6
+ :collected_local_images => :local_images,
7
+ :collected_remote_images => :remote_images,
8
+ :collected_problems => :problems
9
+ }.each do |dsl_method, method|
10
+ define_method dsl_method do |*args, &block|
11
+ SiteChecker.send method, *args, &block
12
+ end
13
+ end
14
+ end
15
+ end
16
+
17
+ include SiteChecker::DSL
@@ -0,0 +1,43 @@
1
+ module SiteChecker
2
+ module IO
3
+ class ContentFromFileSystem
4
+
5
+ def initialize(visit_references, root)
6
+ @visit_references = visit_references
7
+ @root = root
8
+ end
9
+
10
+ def get(link)
11
+ begin
12
+ location = create_absolute_reference(link.url)
13
+ if link.local_page?
14
+ content = File.open(add_index_html(location)).read
15
+ elsif link.local_image?
16
+ File.open(location)
17
+ elsif @visit_references
18
+ open(link.url)
19
+ end
20
+ rescue Errno::ENOENT => e
21
+ raise "(404 Not Found)"
22
+ rescue => e
23
+ raise "(#{e.message.strip})"
24
+ end
25
+ content
26
+ end
27
+
28
+ private
29
+ def add_index_html(path)
30
+ path = $1 if path.match(/(.+)#/)
31
+ path.end_with?(".html") ? path : File.join(path, "index.html")
32
+ end
33
+
34
+ def create_absolute_reference(link)
35
+ if !link.eql?(@root)
36
+ File.join(@root, link)
37
+ else
38
+ @root
39
+ end
40
+ end
41
+ end
42
+ end
43
+ end
@@ -0,0 +1,36 @@
1
+ module SiteChecker
2
+ module IO
3
+ class ContentFromWeb
4
+
5
+ def initialize(visit_references, root)
6
+ @visit_references = visit_references
7
+ @root = root
8
+ end
9
+
10
+ def get(link)
11
+ begin
12
+ uri = create_absolute_reference(link.url)
13
+ if link.local_page?
14
+ content = open(uri)
15
+ elsif link.local_image?
16
+ open(uri)
17
+ elsif @visit_references
18
+ open(uri)
19
+ end
20
+ rescue => e
21
+ raise "(#{e.message.strip})"
22
+ end
23
+ content
24
+ end
25
+
26
+ private
27
+ def create_absolute_reference(link)
28
+ if link.start_with?(@root)
29
+ URI(link)
30
+ else
31
+ URI(@root).merge(link)
32
+ end
33
+ end
34
+ end
35
+ end
36
+ end
@@ -0,0 +1,60 @@
1
+ module SiteChecker
2
+ class Link
3
+ attr_accessor :url
4
+ attr_accessor :parent_url
5
+ attr_accessor :kind
6
+ attr_accessor :location
7
+ attr_accessor :problem
8
+
9
+ def eql?(other)
10
+ ignore_trailing_slash(@url).eql? ignore_trailing_slash(other.url)
11
+ end
12
+
13
+ def ==(other)
14
+ eql?(other)
15
+ end
16
+
17
+ def hash
18
+ ignore_trailing_slash(@url).hash
19
+ end
20
+
21
+ def self.create(attrs)
22
+ link = Link.new
23
+ attrs.each do |key, value|
24
+ if self.instance_methods.include?("#{key}=".to_sym)
25
+ eval("link.#{key}=value")
26
+ end
27
+ end
28
+ link
29
+ end
30
+
31
+ def has_problem?
32
+ @problem != nil
33
+ end
34
+
35
+ def local_page?
36
+ @location == :local && @kind == :page
37
+ end
38
+
39
+ def local_image?
40
+ @location == :local && @kind == :image
41
+ end
42
+
43
+ def anchor?
44
+ @kind == :anchor
45
+ end
46
+
47
+ def anchor_ref?
48
+ @kind == :anchor_ref
49
+ end
50
+
51
+ def anchor_related?
52
+ anchor? || anchor_ref?
53
+ end
54
+
55
+ private
56
+ def ignore_trailing_slash(url)
57
+ url.gsub(/^\//,"")
58
+ end
59
+ end
60
+ end
@@ -0,0 +1,153 @@
1
+ module SiteChecker
2
+ class LinkCollector
3
+ attr_accessor :ignore_list, :visit_references, :max_recursion_depth
4
+
5
+ def initialize
6
+ yield self if block_given?
7
+ @ignore_list ||= []
8
+ @visit_references ||= false
9
+ @max_recursion_depth ||= -1
10
+ end
11
+
12
+ def check(url, root)
13
+ @links = []
14
+ @recursion_depth = 0
15
+ @root = root
16
+
17
+ @content_reader = get_content_reader
18
+
19
+ link = Link.create({:url => url, :kind => :page, :location => :local})
20
+ register_visit(link)
21
+ process_local_page(link)
22
+ evaluate_anchors
23
+ end
24
+
25
+ def local_pages
26
+ get_urls(:local, :page)
27
+ end
28
+
29
+ def remote_pages
30
+ get_urls(:remote, :page)
31
+ end
32
+
33
+ def local_images
34
+ get_urls(:local, :image)
35
+ end
36
+
37
+ def remote_images
38
+ get_urls(:remote, :image)
39
+ end
40
+
41
+ def problems
42
+ problems = {}
43
+ @links.each do |link|
44
+ if link.has_problem?
45
+ problems[link.parent_url] ||= []
46
+ problems[link.parent_url] << "#{link.url} #{link.problem}"
47
+ end
48
+ end
49
+ problems
50
+ end
51
+
52
+ private
53
+ def get_content_reader
54
+ if URI(@root).absolute?
55
+ SiteChecker::IO::ContentFromWeb.new(@visit_references, @root)
56
+ else
57
+ SiteChecker::IO::ContentFromFileSystem.new(@visit_references, @root)
58
+ end
59
+ end
60
+
61
+ def get_urls(location, kind)
62
+ @links.find_all do |link|
63
+ if link.location == location && link.kind == kind
64
+ link
65
+ end
66
+ end.map do |link|
67
+ link.url
68
+ end
69
+ end
70
+
71
+ def process_local_page(parent)
72
+ links = collect_links(parent)
73
+
74
+ links.each do |link|
75
+ link.parent_url = parent.url
76
+ unless link.anchor_related?
77
+ visit(link) unless visited?(link)
78
+ else
79
+ @links << link
80
+ end
81
+ end
82
+ end
83
+
84
+ def register_visit(link)
85
+ @links << link unless visited?(link)
86
+ end
87
+
88
+ def visited?(link)
89
+ @links.include?(link)
90
+ end
91
+
92
+ def visit(link)
93
+ register_visit(link)
94
+ unless link.local_page?
95
+ open_reference(link)
96
+ else
97
+ unless stop_recursion?
98
+ @recursion_depth += 1
99
+ process_local_page(link)
100
+ @recursion_depth -= 1
101
+ end
102
+ end
103
+ end
104
+
105
+ def open_reference(link)
106
+ content = nil
107
+ begin
108
+ content = @content_reader.get(link)
109
+ rescue => e
110
+ link.problem = "#{e.message.strip}"
111
+ end
112
+ content
113
+ end
114
+
115
+ def collect_links(link)
116
+ content = open_reference(link)
117
+ return SiteChecker::Parse::Page.parse(content, @ignore_list, @root)
118
+ end
119
+
120
+ def stop_recursion?
121
+ if @max_recursion_depth == -1
122
+ false
123
+ elsif @max_recursion_depth > @recursion_depth
124
+ false
125
+ else
126
+ true
127
+ end
128
+ end
129
+
130
+ def evaluate_anchors
131
+ anchors = @links.find_all {|link| link.anchor?}
132
+ anchor_references = @links.find_all {|link| link.anchor_ref?}
133
+ anchor_references.each do |anchor_ref|
134
+ if find_matching_anchor(anchors, anchor_ref).empty?
135
+ anchor_ref.problem = "(404 Not Found)"
136
+ end
137
+ end
138
+ end
139
+
140
+ def find_matching_anchor(anchors, anchor_ref)
141
+ result = []
142
+ anchors.each do |anchor|
143
+ if (anchor.parent_url == anchor_ref.parent_url &&
144
+ anchor_ref.url == "##{anchor.url}") ||
145
+ (anchor.parent_url != anchor_ref.parent_url &&
146
+ anchor_ref.url == "#{anchor.parent_url}##{anchor.url}")
147
+ result << anchor
148
+ end
149
+ end
150
+ result
151
+ end
152
+ end
153
+ end
@@ -0,0 +1,82 @@
1
+ module SiteChecker
2
+ module Parse
3
+ class Page
4
+ def self.parse(content, ignore_list, root)
5
+ links = []
6
+ page = Nokogiri(content)
7
+
8
+ links.concat(get_links(page, ignore_list, root))
9
+ links.concat(get_images(page, ignore_list, root))
10
+ links.concat(get_anchors(page))
11
+ links.concat(local_pages_which_has_anchor_references(links, root))
12
+
13
+ links.uniq
14
+ end
15
+
16
+ private
17
+ def self.get_links(page, ignore_list, root)
18
+ links = []
19
+ page.xpath("//a").reject {|a| ignored?(ignore_list, a['href'])}.each do |a|
20
+ if a['href'].match(/(.*)#.+/) && !URI($1).absolute?
21
+ kind = :anchor_ref
22
+ else
23
+ kind = :page
24
+ end
25
+ links << Link.create({:url => a['href'], :kind => kind})
26
+ end
27
+ set_location(links, root)
28
+ end
29
+
30
+ def self.get_images(page, ignore_list, root)
31
+ links = []
32
+ page.xpath("//img").reject {|img| ignored?(ignore_list, img['src'])}.each do |img|
33
+ links << Link.create({:url => img['src'], :kind => :image})
34
+ end
35
+ set_location(links, root)
36
+ end
37
+
38
+ def self.set_location(links, root)
39
+ links.each do |link|
40
+ uri = URI(link.url)
41
+ if uri.to_s.start_with?(root)
42
+ link.problem = "(absolute path)"
43
+ link.location = :local
44
+ else
45
+ if uri.absolute?
46
+ link.location = :remote
47
+ else
48
+ link.location = :local
49
+ end
50
+ end
51
+ end
52
+ end
53
+
54
+ def self.ignored?(ignore_list, link)
55
+ if link
56
+ ignore_list.include? link
57
+ else
58
+ true
59
+ end
60
+ end
61
+
62
+ def self.get_anchors(page)
63
+ anchors = []
64
+ page.xpath("//a").reject {|a| !a['id']}.each do |a|
65
+ anchors << Link.create({:url => a['id'], :kind => :anchor})
66
+ end
67
+ anchors
68
+ end
69
+
70
+ def self.local_pages_which_has_anchor_references(links, root)
71
+ new_links = []
72
+ links.find_all {|link| link.anchor_ref?}.each do |link|
73
+ uri = URI(link.url)
74
+ if link.url.match(/(.+)#/)
75
+ new_links << Link.create({:url => $1, :kind => :page})
76
+ end
77
+ end
78
+ set_location(new_links, root)
79
+ end
80
+ end
81
+ end
82
+ end