robotstxt-parser 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: ab84cf493844dcbd92489c277344cf25746adff2
4
+ data.tar.gz: 285a77121e447cbec3f192eed1a3a7de03bc1bdc
5
+ SHA512:
6
+ metadata.gz: 567cff3966ac583e462b7e8ab337d27695f76a20a015ddb2ca42c8052823fb5cef03b80000974ce39b2bcbc42d4ad707a826dd87cd2b1468f02fa996a3a4dcee
7
+ data.tar.gz: 054054c786da1d87adc3f853c4cfa0fc6a24238c2dbd2cb47ce6d3e25babb2bb315e59625cfacfd2d97e2aaad55e39d116f24670ebb09eb1ccdc20f48af625cd
data/.gitignore ADDED
@@ -0,0 +1,26 @@
1
+ *.gem
2
+ *.rbc
3
+ .bundle
4
+ .config
5
+ coverage
6
+ InstalledFiles
7
+ lib/bundler/man
8
+ pkg
9
+ rdoc
10
+ spec/reports
11
+ test/tmp
12
+ test/version_tmp
13
+ tmp
14
+
15
+ Gemfile.lock
16
+ out/
17
+ sample.rb
18
+ run_sample.rb
19
+ src/
20
+ docs/
21
+
22
+ # YARD artifacts
23
+ .yardoc
24
+ _yardoc
25
+ doc/
26
+ .DS_Store
data/.travis.yml ADDED
@@ -0,0 +1,6 @@
1
+ language: ruby
2
+ rvm:
3
+ - 2.0
4
+
5
+ install:
6
+ - bundle install
data/Gemfile ADDED
@@ -0,0 +1,3 @@
1
+ source "http://rubygems.org"
2
+
3
+ gemspec
data/LICENSE.rdoc ADDED
@@ -0,0 +1,26 @@
1
+ = License
2
+
3
+ (The MIT License)
4
+
5
+ Copyright (c) 2010 Conrad Irwin <conrad@rapportive.com>
6
+ Copyright (c) 2009 Simone Rinzivillo <srinzivillo@gmail.com>
7
+
8
+ Permission is hereby granted, free of charge, to any person obtaining
9
+ a copy of this software and associated documentation files (the
10
+ "Software"), to deal in the Software without restriction, including
11
+ without limitation the rights to use, copy, modify, merge, publish,
12
+ distribute, sublicense, and/or sell copies of the Software, and to
13
+ permit persons to whom the Software is furnished to do so, subject to
14
+ the following conditions:
15
+
16
+ The above copyright notice and this permission notice shall be
17
+ included in all copies or substantial portions of the Software.
18
+
19
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
20
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
21
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
22
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
23
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
24
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
25
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
26
+
data/README.rdoc ADDED
@@ -0,0 +1,199 @@
1
+ = Robotstxt
2
+
3
+ Robotstxt is an Ruby robots.txt file parser.
4
+
5
+ The robots.txt exclusion protocol is a simple mechanism whereby site-owners can guide
6
+ any automated crawlers to relevant parts of their site, and prevent them accessing content
7
+ which is intended only for other eyes. For more information, see http://www.robotstxt.org/.
8
+
9
+ This library provides mechanisms for obtaining and parsing the robots.txt file from
10
+ websites. As there is no official "standard" it tries to do something sensible,
11
+ though inspiration was taken from:
12
+
13
+ - http://www.robotstxt.org/orig.html
14
+ - http://www.robotstxt.org/norobots-rfc.txt
15
+ - http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=156449&from=35237
16
+ - http://nikitathespider.com/articles/RobotsTxt.html
17
+
18
+ While the parsing semantics of this library are explained below, you should not
19
+ write sitemaps that depend on all robots acting the same -- they simply won't.
20
+ Even the various different ruby libraries support very different subsets of
21
+ functionality.
22
+
23
+ This gem builds on the work of Simone Rinzivillo, and is released under the MIT
24
+ license -- see the LICENSE file.
25
+
26
+ == Usage
27
+
28
+ There are two public points of interest, firstly the Robotstxt module, and
29
+ secondly the Robotstxt::Parser class.
30
+
31
+ The Robotstxt module has three public methods:
32
+
33
+ - Robotstxt.get source, user_agent, (options)
34
+ Returns a Robotstxt::Parser for the robots.txt obtained from source.
35
+
36
+ - Robotstxt.parse robots_txt, user_agent
37
+ Returns a Robotstxt::Parser for the robots.txt passed in
38
+
39
+ - Robotstxt.get_allowed? urlish, user_agent, (options)
40
+ Returns true iff the robots.txt obtained from the host identified by the
41
+ urlish allows the given user agent access to the url.
42
+
43
+ The Robotstxt::Parser class contains two pieces of state, the user_agent and the
44
+ text of the robots.txt. In addition its instances have two public methods:
45
+
46
+ - Robotstxt::Parser#allowed? urlish
47
+ Returns true iff the robots.txt file allows this user_agent access to that
48
+ url.
49
+
50
+ - Robotstxt::Parser#sitemaps
51
+ Returns a list of the sitemaps listed in the robots.txt file.
52
+
53
+ In the above there are five kinds of parameter,
54
+
55
+ A "urlish" is either a String that represents a URL (suitable for passing to
56
+ URI.parse) or a URI object, i.e.
57
+
58
+ urlish = "http://www.example.com/"
59
+ urlish = "/index.html"
60
+ urlish = https://compicat.ed/home?action=fire#joking"
61
+ urlish = URI.parse("http://example.co.uk")
62
+
63
+ A "source" is either a "urlish", or a Net::HTTP connection. This allows the
64
+ library to re-use the same connection when the server respects Keep-alive:
65
+ headers, i.e.
66
+
67
+ source = Net::HTTP.new("example.com", 80)
68
+ Net::HTTP.start("example.co.uk", 80) do |http|
69
+ source = http
70
+ end
71
+ source = "http://www.example.com/index.html"
72
+
73
+ When a "urlish" is provided, only the host and port sections are used, and
74
+ the path is forced to "/robots.txt".
75
+
76
+ A "robots_txt" is the textual content of a robots.txt file that is in the
77
+ same encoding as the urls you will be fetching (normally utf8).
78
+
79
+ A "user_agent" is the string value you use in your User-agent: header.
80
+
81
+ The "options" is an optional hash containing
82
+ :num_redirects (5) - the number of redirects to follow before giving up.
83
+ :http_timeout (10) - the length of time in seconds to wait for one http
84
+ request
85
+ :url_charset (utf8) - the charset which you will use to encode your urls.
86
+
87
+ I recommend not passing the options unless you have to.
88
+
89
+ == Examples
90
+
91
+ url = "http://example.com/index.html"
92
+ if Robotstxt.get_allowed?(url, "Crawler")
93
+ open(url)
94
+ end
95
+
96
+
97
+ Net::HTTP.start("example.co.uk") do |http|
98
+ robots = Robotstxt.get(http, "Crawler")
99
+
100
+ if robots.allowed? "/index.html"
101
+ http.get("/index.html")
102
+ elsif robots.allowed? "/index.php"
103
+ http.get("/index.php")
104
+ end
105
+ end
106
+
107
+ == Details
108
+
109
+ === Request level
110
+
111
+ This library handles different HTTP status codes according to the specifications
112
+ on robotstxt.org, in particular:
113
+
114
+ If an HTTPUnauthorized or an HTTPForbidden is returned when trying to access
115
+ /robots.txt, then the entire site should be considered "Disallowed".
116
+
117
+ If an HTTPRedirection is returned, it should be followed (though we give up
118
+ after five redirects, to avoid infinite loops).
119
+
120
+ If an HTTPSuccess is returned, the body is converted into utf8, and then parsed.
121
+
122
+ Any other response, or no response, indicates that there are no Disallowed urls
123
+ no the site.
124
+
125
+ === User-agent matching
126
+
127
+ This is case-insensitive, substring matching, i.e. equivalent to matching the
128
+ user agent with /.*thing.*/i.
129
+
130
+ Additionally, * characters are interpreted as meaning any number of any character (in
131
+ regular expression idiom: /.*/). Google implies that it does this, at least for
132
+ trailing *s, and the standard implies that "*" is a special user agent meaning
133
+ "everything not referred to so far".
134
+
135
+ There can be multiple User-agent: lines for each section of Allow: and Disallow:
136
+ lines in the robots.txt file:
137
+
138
+ User-agent: Google
139
+ User-agent: Bing
140
+ Disallow: /secret
141
+
142
+ In cases like this, all user-agents inherit the same set of rules.
143
+
144
+ === Path matching
145
+
146
+ This is case-sensitive prefix matching, i.e. equivalent to matching the
147
+ requested path (or path + '?' + query) against /^thing.*/. As with user-agents,
148
+ * is interpreted as any number of any character.
149
+
150
+ Additionally, when the pattern ends with a $, it forces the pattern to match the
151
+ entire path (or path + ? + query).
152
+
153
+ In order to get consistent results, before the globs are matched, the %-encoding
154
+ is normalised so that only /?&= remain %-encoded. For example, /h%65llo/ is the
155
+ same as /hello/, but /ac%2fdc is not the same as /ac/dc - this is due to the
156
+ significance granted to the / operator in urls.
157
+
158
+ The paths of the first section that matched our user-agent (by order of
159
+ appearance in the file) are parsed in order of appearance. The first Allow: or
160
+ Disallow: rule that matches the url is accepted. This is prescribed by
161
+ robotstxt.org, but other parsers take wildly different strategies:
162
+ Google checks all Allows: then all Disallows:
163
+ Bing checks the most-specific first
164
+ Others check all Disallows: then all Allows
165
+
166
+ As is conventional, a "Disallow: " line with no path given is treated as
167
+ "Allow: *", and if a URL didn't match any path specifiers (or the user-agent
168
+ didn't match any user-agent sections) then that is implicit permission to crawl.
169
+
170
+ == TODO
171
+
172
+ I would like to add support for the Crawl-delay directive, and indeed any other
173
+ parameters in use.
174
+
175
+ == Requirements
176
+
177
+ * Ruby >= 1.8.7
178
+ * iconv, net/http and uri
179
+
180
+ == Installation
181
+
182
+ This library is intended to be installed via the
183
+ RubyGems[http://rubyforge.org/projects/rubygems/] system.
184
+
185
+ $ gem install robotstxt
186
+
187
+ You might need administrator privileges on your system to install it.
188
+
189
+ == Author
190
+
191
+ Author:: {Conrad Irwin} <conrad@rapportive.com>
192
+ Author:: {Simone Rinzivillo}[http://www.simonerinzivillo.it/] <srinzivillo@gmail.com>
193
+
194
+ == License
195
+
196
+ Robotstxt is released under the MIT license.
197
+ Copyright (c) 2010 Conrad Irwin
198
+ Copyright (c) 2009 Simone Rinzivillo
199
+
data/Rakefile ADDED
@@ -0,0 +1,12 @@
1
+ require 'rake/testtask'
2
+
3
+ require 'bundler'
4
+ Bundler::GemHelper.install_tasks
5
+
6
+ Rake::TestTask.new do |t|
7
+ t.libs << "test"
8
+ t.test_files = FileList['test/*_test.rb']
9
+ t.verbose = true
10
+ end
11
+
12
+ task :default => [:test]
data/lib/robotstxt.rb ADDED
@@ -0,0 +1,93 @@
1
+ #
2
+ # = Ruby Robotstxt
3
+ #
4
+ # An Ruby Robots.txt parser.
5
+ #
6
+ #
7
+ # Category:: Net
8
+ # Package:: Robotstxt
9
+ # Author:: Conrad Irwin <conrad@rapportive.com>, Simone Rinzivillo <srinzivillo@gmail.com>
10
+ # License:: MIT License
11
+ #
12
+ #--
13
+ #
14
+ #++
15
+
16
+ require 'robotstxt/common'
17
+ require 'robotstxt/parser'
18
+ require 'robotstxt/getter'
19
+
20
+ # Provides a flexible interface to help authors of web-crawlers
21
+ # respect the robots.txt exclusion standard.
22
+ #
23
+ module Robotstxt
24
+
25
+ NAME = 'Robotstxt'
26
+ GEM = 'robotstxt'
27
+ AUTHORS = ['Conrad Irwin <conrad@rapportive.com>', 'Simone Rinzivillo <srinzivillo@gmail.com>']
28
+ VERSION = '1.0'
29
+
30
+ # Obtains and parses a robotstxt file from the host identified by source,
31
+ # source can either be a URI, a string representing a URI, or a Net::HTTP
32
+ # connection associated with a host.
33
+ #
34
+ # The second parameter should be the user-agent header for your robot.
35
+ #
36
+ # There are currently two options:
37
+ # :num_redirects (default 5) is the maximum number of HTTP 3** responses
38
+ # the get() method will accept and follow the Location: header before
39
+ # giving up.
40
+ # :http_timeout (default 10) is the number of seconds to wait for each
41
+ # request before giving up.
42
+ # :url_charset (default "utf8") the character encoding you will use to
43
+ # encode urls.
44
+ #
45
+ # As indicated by robotstxt.org, this library treats HTTPUnauthorized and
46
+ # HTTPForbidden as though the robots.txt file denied access to the entire
47
+ # site, all other HTTP responses or errors are treated as though the site
48
+ # allowed all access.
49
+ #
50
+ # The return value is a Robotstxt::Parser, which you can then interact with
51
+ # by calling .allowed? or .sitemaps. i.e.
52
+ #
53
+ # Robotstxt.get("http://example.com/", "SuperRobot").allowed? "/index.html"
54
+ #
55
+ # Net::HTTP.open("example.com") do |http|
56
+ # if Robotstxt.get(http, "SuperRobot").allowed? "/index.html"
57
+ # http.get("/index.html")
58
+ # end
59
+ # end
60
+ #
61
+ def self.get(source, robot_id, options={})
62
+ self.parse(Getter.new.obtain(source, robot_id, options), robot_id)
63
+ end
64
+
65
+ # Parses the contents of a robots.txt file for the given robot_id
66
+ #
67
+ # Returns a Robotstxt::Parser object with methods .allowed? and
68
+ # .sitemaps, i.e.
69
+ #
70
+ # Robotstxt.parse("User-agent: *\nDisallow: /a", "SuperRobot").allowed? "/b"
71
+ #
72
+ def self.parse(robotstxt, robot_id)
73
+ Parser.new(robot_id, robotstxt)
74
+ end
75
+
76
+ # Gets a robotstxt file from the host identified by the uri
77
+ # (which can be a URI object or a string)
78
+ #
79
+ # Parses it for the given robot_id
80
+ # (which should be your user-agent)
81
+ #
82
+ # Returns true iff your robot can access said uri.
83
+ #
84
+ # Robotstxt.get_allowed? "http://www.example.com/good", "SuperRobot"
85
+ #
86
+ def self.get_allowed?(uri, robot_id)
87
+ self.get(uri, robot_id).allowed? uri
88
+ end
89
+
90
+ def self.ultimate_scrubber(str)
91
+ str.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => '')
92
+ end
93
+ end
@@ -0,0 +1,25 @@
1
+ require 'uri'
2
+ require 'net/http'
3
+
4
+ module Robotstxt
5
+ module CommonMethods
6
+
7
+ protected
8
+ # Convert a URI or a String into a URI
9
+ def objectify_uri(uri)
10
+
11
+ if uri.is_a? String
12
+ # URI.parse will explode when given a character that it thinks
13
+ # shouldn't appear in uris. We thus escape them before passing the
14
+ # string into the function. Unfortunately URI.escape does not respect
15
+ # all characters that have meaning in HTTP (esp. #), so we are forced
16
+ # to state exactly which characters we would like to escape.
17
+ uri = URI.escape(uri, %r{[^!$#%&'()*+,\-./0-9:;=?@A-Z_a-z~]})
18
+ uri = URI.parse(uri)
19
+ else
20
+ uri
21
+ end
22
+
23
+ end
24
+ end
25
+ end
@@ -0,0 +1,79 @@
1
+ module Robotstxt
2
+ class Getter
3
+ include CommonMethods
4
+
5
+ # Get the text of a robots.txt file from the given source, see #get.
6
+ def obtain(source, robot_id, options)
7
+ options = {
8
+ :num_redirects => 5,
9
+ :http_timeout => 10
10
+ }.merge(options)
11
+
12
+ robotstxt = if source.is_a? Net::HTTP
13
+ obtain_via_http(source, "/robots.txt", robot_id, options)
14
+ else
15
+ uri = objectify_uri(source)
16
+ http = Net::HTTP.new(uri.host, uri.port)
17
+ http.read_timeout = options[:http_timeout]
18
+ if uri.scheme == 'https'
19
+ http.use_ssl = true
20
+ http.verify_mode = OpenSSL::SSL::VERIFY_NONE
21
+ end
22
+ obtain_via_http(http, "/robots.txt", robot_id, options)
23
+ end
24
+ end
25
+
26
+ protected
27
+
28
+ # Recursively try to obtain robots.txt following redirects and handling the
29
+ # various HTTP response codes as indicated on robotstxt.org
30
+ def obtain_via_http(http, uri, robot_id, options)
31
+ response = http.get(uri, {'User-Agent' => robot_id})
32
+
33
+ begin
34
+ case response
35
+ when Net::HTTPSuccess
36
+ decode_body(response)
37
+ when Net::HTTPRedirection
38
+ if options[:num_redirects] > 0 && response['location']
39
+ options[:num_redirects] -= 1
40
+ obtain(response['location'], robot_id, options)
41
+ else
42
+ all_allowed
43
+ end
44
+ when Net::HTTPUnauthorized
45
+ all_forbidden
46
+ when Net::HTTPForbidden
47
+ all_forbidden
48
+ else
49
+ all_allowed
50
+ end
51
+ rescue Timeout::Error #, StandardError
52
+ all_allowed
53
+ end
54
+
55
+ end
56
+
57
+ # A robots.txt body that forbids access to everywhere
58
+ def all_forbidden
59
+ "User-agent: *\nDisallow: /\n"
60
+ end
61
+
62
+ # A robots.txt body that allows access to everywhere
63
+ def all_allowed
64
+ "User-agent: *\nDisallow:\n"
65
+ end
66
+
67
+ # Decode the response's body according to the character encoding in the HTTP
68
+ # headers.
69
+ # In the case that we can't decode, Ruby's laissez faire attitude to encoding
70
+ # should mean that we have a reasonable chance of working anyway.
71
+ def decode_body(response)
72
+ return nil if response.body.nil?
73
+ Robotstxt.ultimate_scrubber(response.body)
74
+ end
75
+
76
+
77
+ end
78
+
79
+ end
@@ -0,0 +1,256 @@
1
+
2
+ module Robotstxt
3
+ # Parses robots.txt files for the perusal of a single user-agent.
4
+ #
5
+ # The behaviour implemented is guided by the following sources, though
6
+ # as there is no widely accepted standard, it may differ from other implementations.
7
+ # If you consider its behaviour to be in error, please contact the author.
8
+ #
9
+ # http://www.robotstxt.org/orig.html
10
+ # - the original, now imprecise and outdated version
11
+ # http://www.robotstxt.org/norobots-rfc.txt
12
+ # - a much more precise, outdated version
13
+ # http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=156449&from=35237
14
+ # - a few hints at modern protocol extensions.
15
+ #
16
+ # This parser only considers lines starting with (case-insensitively:)
17
+ # Useragent: User-agent: Allow: Disallow: Sitemap:
18
+ #
19
+ # The file is divided into sections, each of which contains one or more User-agent:
20
+ # lines, followed by one or more Allow: or Disallow: rules.
21
+ #
22
+ # The first section that contains a User-agent: line that matches the robot's
23
+ # user-agent, is the only section that relevent to that robot. The sections are checked
24
+ # in the same order as they appear in the file.
25
+ #
26
+ # (The * character is taken to mean "any number of any characters" during matching of
27
+ # user-agents)
28
+ #
29
+ # Within that section, the first Allow: or Disallow: rule that matches the expression
30
+ # is taken as authoritative. If no rule in a section matches, the access is Allowed.
31
+ #
32
+ # (The order of matching is as in the RFC, Google matches all Allows and then all Disallows,
33
+ # while Bing matches the most specific rule, I'm sure there are other interpretations)
34
+ #
35
+ # When matching urls, all % encodings are normalised (except for /?=& which have meaning)
36
+ # and "*"s match any number of any character.
37
+ #
38
+ # If a pattern ends with a $, then the pattern must match the entire path, or the entire
39
+ # path with query string.
40
+ #
41
+ class Parser
42
+ include CommonMethods
43
+
44
+ # Gets every Sitemap mentioned in the body of the robots.txt file.
45
+ #
46
+ attr_reader :sitemaps
47
+
48
+ # Create a new parser for this user_agent and this robots.txt contents.
49
+ #
50
+ # This assumes that the robots.txt is ready-to-parse, in particular that
51
+ # it has been decoded as necessary, including removal of byte-order-marks et.al.
52
+ #
53
+ # Not passing a body is deprecated, but retained for compatibility with clients
54
+ # written for version 0.5.4.
55
+ #
56
+ def initialize(user_agent, body)
57
+ @robot_id = user_agent
58
+ @found = true
59
+ parse(body) # set @body, @rules and @sitemaps
60
+ end
61
+
62
+ # Given a URI object, or a string representing one, determine whether this
63
+ # robots.txt would allow access to the path.
64
+ def allowed?(uri)
65
+
66
+ uri = objectify_uri(uri)
67
+ path = (uri.path || "/") + (uri.query ? '?' + uri.query : '')
68
+ path_allowed?(@robot_id, path)
69
+
70
+ end
71
+
72
+ protected
73
+
74
+ # Check whether the relative path (a string of the url's path and query
75
+ # string) is allowed by the rules we have for the given user_agent.
76
+ #
77
+ def path_allowed?(user_agent, path)
78
+
79
+ @rules.each do |(ua_glob, path_globs)|
80
+
81
+ if match_ua_glob user_agent, ua_glob
82
+ path_globs.each do |(path_glob, allowed)|
83
+ return allowed if match_path_glob path, path_glob
84
+ end
85
+ return true
86
+ end
87
+
88
+ end
89
+ true
90
+ end
91
+
92
+
93
+ # This does a case-insensitive substring match such that if the user agent
94
+ # is contained within the glob, or vice-versa, we will match.
95
+ #
96
+ # According to the standard, *s shouldn't appear in the user-agent field
97
+ # except in the case of "*" meaning all user agents. Google however imply
98
+ # that the * will work, at least at the end of a string.
99
+ #
100
+ # For consistency, and because it seems expected behaviour, and because
101
+ # a glob * will match a literal * we use glob matching not string matching.
102
+ #
103
+ # The standard also advocates a substring match of the robot's user-agent
104
+ # within the user-agent field. From observation, it seems much more likely
105
+ # that the match will be the other way about, though we check for both.
106
+ #
107
+ def match_ua_glob(user_agent, glob)
108
+
109
+ glob =~ Regexp.new(Regexp.escape(user_agent), "i") ||
110
+ user_agent =~ Regexp.new(reify(glob), "i")
111
+
112
+ end
113
+
114
+ # This does case-sensitive prefix matching, such that if the path starts
115
+ # with the glob, we will match.
116
+ #
117
+ # According to the standard, that's it. However, it seems reasonably common
118
+ # for asterkisks to be interpreted as though they were globs.
119
+ #
120
+ # Additionally, some search engines, like Google, will treat a trailing $
121
+ # sign as forcing the glob to match the entire path - whether including
122
+ # or excluding the query string is not clear, so we check both.
123
+ #
124
+ # (i.e. it seems likely that a site owner who has Disallow: *.pdf$ expects
125
+ # to disallow requests to *.pdf?i_can_haz_pdf, which the robot could, if
126
+ # it were feeling malicious, construe.)
127
+ #
128
+ # With URLs there is the additional complication that %-encoding can give
129
+ # multiple representations for identical URLs, this is handled by
130
+ # normalize_percent_encoding.
131
+ #
132
+ def match_path_glob(path, glob)
133
+
134
+ if glob =~ /\$$/
135
+ end_marker = '(?:\?|$)'
136
+ glob = glob.gsub /\$$/, ""
137
+ else
138
+ end_marker = ""
139
+ end
140
+
141
+ glob = Robotstxt.ultimate_scrubber normalize_percent_encoding(glob)
142
+ path = Robotstxt.ultimate_scrubber normalize_percent_encoding(path)
143
+
144
+ path =~ Regexp.new("^" + reify(glob) + end_marker)
145
+
146
+ # Some people encode bad UTF-8 in their robots.txt files, let us not behave badly.
147
+ rescue RegexpError
148
+ false
149
+ end
150
+
151
+ # As a general rule, we want to ignore different representations of the
152
+ # same URL. Naively we could just unescape, or escape, everything, however
153
+ # the standard implies that a / is a HTTP path separator, while a %2F is an
154
+ # encoded / that does not act as a path separator. Similar issues with ?, &
155
+ # and =, though all other characters are fine. (While : also has a special
156
+ # meaning in HTTP, most implementations ignore this in the path)
157
+ #
158
+ # It's also worth noting that %-encoding is case-insensitive, so we
159
+ # explicitly upcase the few that we want to keep.
160
+ #
161
+ def normalize_percent_encoding(path)
162
+
163
+ # First double-escape any characters we don't want to unescape
164
+ # & / = ?
165
+ path = path.gsub(/%(26|2F|3D|3F)/i) do |code|
166
+ "%25#{code.upcase}"
167
+ end
168
+
169
+ URI.unescape(path)
170
+
171
+ end
172
+
173
+ # Convert the asterisks in a glob into (.*)s for regular expressions,
174
+ # and at the same time, escape any other characters that would have
175
+ # a significance in a regex.
176
+ #
177
+ def reify(glob)
178
+ glob = Robotstxt.ultimate_scrubber(glob)
179
+
180
+ # -1 on a split prevents trailing empty strings from being deleted.
181
+ glob.split("*", -1).map{ |part| Regexp.escape(part) }.join(".*")
182
+
183
+ end
184
+
185
+ # Convert the @body into a set of @rules so that our parsing mechanism
186
+ # becomes easier.
187
+ #
188
+ # @rules is an array of pairs. The first in the pair is the glob for the
189
+ # user-agent and the second another array of pairs. The first of the new
190
+ # pair is a glob for the path, and the second whether it appears in an
191
+ # Allow: or a Disallow: rule.
192
+ #
193
+ # For example:
194
+ #
195
+ # User-agent: *
196
+ # Disallow: /secret/
197
+ # Allow: / # allow everything...
198
+ #
199
+ # Would be parsed so that:
200
+ #
201
+ # @rules = [["*", [ ["/secret/", false], ["/", true] ]]]
202
+ #
203
+ #
204
+ # The order of the arrays is maintained so that the first match in the file
205
+ # is obeyed as indicated by the pseudo-RFC on http://robotstxt.org/. There
206
+ # are alternative interpretations, some parse by speicifity of glob, and
207
+ # some check Allow lines for any match before Disallow lines. All are
208
+ # justifiable, but we could only pick one.
209
+ #
210
+ # Note that a blank Disallow: should be treated as an Allow: * and multiple
211
+ # user-agents may share the same set of rules.
212
+ #
213
+ def parse(body)
214
+
215
+ @body = Robotstxt.ultimate_scrubber(body)
216
+ @rules = []
217
+ @sitemaps = []
218
+
219
+ body.split(/[\r\n]+/).each do |line|
220
+ prefix, value = line.delete("\000").split(":", 2).map(&:strip)
221
+ value.sub! /\s+#.*/, '' if value
222
+ parser_mode = :begin
223
+
224
+ if prefix && value
225
+
226
+ case prefix.downcase
227
+ when /^user-?agent$/
228
+ if parser_mode == :user_agent
229
+ @rules << [value, rules.last[1]]
230
+ else
231
+ parser_mode = :user_agent
232
+ @rules << [value, []]
233
+ end
234
+ when "disallow"
235
+ parser_mode = :rules
236
+ @rules << ["*", []] if @rules.empty?
237
+
238
+ if value == ""
239
+ @rules.last[1] << ["*", true]
240
+ else
241
+ @rules.last[1] << [value, false]
242
+ end
243
+ when "allow"
244
+ parser_mode = :rules
245
+ @rules << ["*", []] if @rules.empty?
246
+ @rules.last[1] << [value, true]
247
+ when "sitemap"
248
+ @sitemaps << value
249
+ else
250
+ # Ignore comments, Crawl-delay: and badly formed lines.
251
+ end
252
+ end
253
+ end
254
+ end
255
+ end
256
+ end
data/robotstxt.gemspec ADDED
@@ -0,0 +1,19 @@
1
+ # -*- encoding: utf-8 -*-
2
+ $:.push File.expand_path("../lib", __FILE__)
3
+
4
+ Gem::Specification.new do |gem|
5
+ gem.name = "robotstxt-parser"
6
+ gem.version = "0.1.0"
7
+ gem.authors = ["Garen Torikian"]
8
+ gem.email = ["gjtorikian@gmail.com"]
9
+ gem.description = %q{Robotstxt-Parser allows you to the check the accessibility of URLs and get other data. Full support for the robots.txt RFC, wildcards and Sitemap: rules.}
10
+ gem.summary = %q{Robotstxt-parser is an Ruby robots.txt file parser.}
11
+ gem.homepage = "https://github.com/gjtorikian/robotstxt-parser"
12
+ gem.license = "MIT"
13
+ gem.files = `git ls-files`.split($/)
14
+ gem.test_files = gem.files.grep(%r{^(text)/})
15
+ gem.require_paths = ["lib"]
16
+
17
+ gem.add_development_dependency "rake"
18
+ gem.add_development_dependency "fakeweb", '~> 1.3'
19
+ end
@@ -0,0 +1,74 @@
1
+ # -*- encoding: utf-8 -*-
2
+
3
+ $:.unshift(File.dirname(__FILE__) + '/../lib')
4
+
5
+ require 'rubygems'
6
+ require 'test/unit'
7
+ require 'robotstxt'
8
+ require 'fakeweb'
9
+
10
+ FakeWeb.allow_net_connect = false
11
+
12
+ class TestRobotstxt < Test::Unit::TestCase
13
+
14
+ def test_absense
15
+ FakeWeb.register_uri(:get, "http://example.com/robots.txt", :status => ["404", "Not found"])
16
+ assert true == Robotstxt.get_allowed?("http://example.com/index.html", "Google")
17
+ end
18
+
19
+ def test_error
20
+ FakeWeb.register_uri(:get, "http://example.com/robots.txt", :status => ["500", "Internal Server Error"])
21
+ assert true == Robotstxt.get_allowed?("http://example.com/index.html", "Google")
22
+ end
23
+
24
+ def test_unauthorized
25
+ FakeWeb.register_uri(:get, "http://example.com/robots.txt", :status => ["401", "Unauthorized"])
26
+ assert false == Robotstxt.get_allowed?("http://example.com/index.html", "Google")
27
+ end
28
+
29
+ def test_forbidden
30
+ FakeWeb.register_uri(:get, "http://example.com/robots.txt", :status => ["403", "Forbidden"])
31
+ assert false == Robotstxt.get_allowed?("http://example.com/index.html", "Google")
32
+ end
33
+
34
+ def test_uri_object
35
+ FakeWeb.register_uri(:get, "http://example.com/robots.txt", :body => "User-agent:*\nDisallow: /test")
36
+
37
+ robotstxt = Robotstxt.get(URI.parse("http://example.com/index.html"), "Google")
38
+
39
+ assert true == robotstxt.allowed?("/index.html")
40
+ assert false == robotstxt.allowed?("/test/index.html")
41
+ end
42
+
43
+ def test_existing_http_connection
44
+ FakeWeb.register_uri(:get, "http://example.com/robots.txt", :body => "User-agent:*\nDisallow: /test")
45
+
46
+ http = Net::HTTP.start("example.com", 80) do |http|
47
+ robotstxt = Robotstxt.get(http, "Google")
48
+ assert true == robotstxt.allowed?("/index.html")
49
+ assert false == robotstxt.allowed?("/test/index.html")
50
+ end
51
+ end
52
+
53
+ def test_redirects
54
+ FakeWeb.register_uri(:get, "http://example.com/robots.txt", :response => "HTTP/1.1 303 See Other\nLocation: http://www.exemplar.com/robots.txt\n\n")
55
+ FakeWeb.register_uri(:get, "http://www.exemplar.com/robots.txt", :body => "User-agent:*\nDisallow: /private")
56
+
57
+ robotstxt = Robotstxt.get("http://example.com/", "Google")
58
+
59
+ assert true == robotstxt.allowed?("/index.html")
60
+ assert false == robotstxt.allowed?("/private/index.html")
61
+ end
62
+
63
+ def test_encoding
64
+ # "User-agent: *\n Disallow: /encyclop@dia" where @ is the ae ligature (U+00E6)
65
+ FakeWeb.register_uri(:get, "http://example.com/robots.txt", :response => "HTTP/1.1 200 OK\nContent-type: text/plain; charset=utf-16\n\n" +
66
+ "\xff\xfeU\x00s\x00e\x00r\x00-\x00a\x00g\x00e\x00n\x00t\x00:\x00 \x00*\x00\n\x00D\x00i\x00s\x00a\x00l\x00l\x00o\x00w\x00:\x00 \x00/\x00e\x00n\x00c\x00y\x00c\x00l\x00o\x00p\x00\xe6\x00d\x00i\x00a\x00")
67
+ robotstxt = Robotstxt.get("http://example.com/#index", "Google")
68
+
69
+ assert true == robotstxt.allowed?("/index.html")
70
+ assert false == robotstxt.allowed?("/encyclop%c3%a6dia/index.html")
71
+
72
+ end
73
+
74
+ end
@@ -0,0 +1,114 @@
1
+ # -*- encoding: utf-8 -*-
2
+
3
+ $:.unshift(File.dirname(__FILE__) + '/../lib')
4
+
5
+ require 'test/unit'
6
+ require 'robotstxt'
7
+ require 'cgi'
8
+
9
+ class TestParser < Test::Unit::TestCase
10
+
11
+ def test_basics
12
+ client = Robotstxt::Parser.new("Test", <<-ROBOTS
13
+ User-agent: *
14
+ Disallow: /?*\t\t\t#comment
15
+ Disallow: /home
16
+ Disallow: /dashboard
17
+ Disallow: /terms-conditions
18
+ Disallow: /privacy-policy
19
+ Disallow: /index.php
20
+ Disallow: /chargify_system
21
+ Disallow: /test*
22
+ Disallow: /team* # comment
23
+ Disallow: /index
24
+ Allow: / # comment
25
+ Sitemap: http://example.com/sitemap.xml
26
+ ROBOTS
27
+ )
28
+ assert true == client.allowed?("/")
29
+ assert false == client.allowed?("/?")
30
+ assert false == client.allowed?("/?key=value")
31
+ assert true == client.allowed?("/example")
32
+ assert true == client.allowed?("/example/index.php")
33
+ assert false == client.allowed?("/test")
34
+ assert false == client.allowed?("/test/example")
35
+ assert false == client.allowed?("/team-game")
36
+ assert false == client.allowed?("/team-game/example")
37
+ assert ["http://example.com/sitemap.xml"] == client.sitemaps
38
+
39
+ end
40
+
41
+ def test_blank_disallow
42
+ google = Robotstxt::Parser.new("Google", <<-ROBOTSTXT
43
+ User-agent: *
44
+ Disallow:
45
+ ROBOTSTXT
46
+ )
47
+ assert true == google.allowed?("/")
48
+ assert true == google.allowed?("/index.html")
49
+ end
50
+
51
+ def test_url_escaping
52
+ google = Robotstxt::Parser.new("Google", <<-ROBOTSTXT
53
+ User-agent: *
54
+ Disallow: /test/
55
+ Disallow: /secret%2Fgarden/
56
+ Disallow: /%61lpha/
57
+ ROBOTSTXT
58
+ )
59
+ assert true == google.allowed?("/allowed/")
60
+ assert false == google.allowed?("/test/")
61
+ assert true == google.allowed?("/test%2Fetc/")
62
+ assert false == google.allowed?("/secret%2fgarden/")
63
+ assert true == google.allowed?("/secret/garden/")
64
+ assert false == google.allowed?("/alph%61/")
65
+ end
66
+
67
+ def test_trail_matching
68
+ google = Robotstxt::Parser.new("Google", <<-ROBOTSTXT
69
+ User-agent: *
70
+ #comments
71
+ Disallow: /*.pdf$
72
+ ROBOTSTXT
73
+ )
74
+ assert true == google.allowed?("/.pdfs/index.html")
75
+ assert false == google.allowed?("/.pdfs/index.pdf")
76
+ assert false == google.allowed?("/.pdfs/index.pdf?action=view")
77
+ assert false == google.allowed?("/.pdfs/index.html?download_as=.pdf")
78
+ end
79
+
80
+ def test_useragents
81
+ robotstxt = <<-ROBOTS
82
+ User-agent: Google
83
+ User-agent: Yahoo
84
+ Disallow:
85
+
86
+ User-agent: *
87
+ Disallow: /
88
+ ROBOTS
89
+ assert true == Robotstxt::Parser.new("Google", robotstxt).allowed?("/hello")
90
+ assert true == Robotstxt::Parser.new("Yahoo", robotstxt).allowed?("/hello")
91
+ assert false == Robotstxt::Parser.new("Bing", robotstxt).allowed?("/hello")
92
+ end
93
+
94
+ def test_missing_useragent
95
+ robotstxt = <<-ROBOTS
96
+ Disallow: /index
97
+ ROBOTS
98
+ assert true === Robotstxt::Parser.new("Google", robotstxt).allowed?("/hello")
99
+ assert false === Robotstxt::Parser.new("Google", robotstxt).allowed?("/index/wold")
100
+ end
101
+
102
+ def test_strange_newlines
103
+ robotstxt = "User-agent: *\r\r\rDisallow: *"
104
+ assert false === Robotstxt::Parser.new("Google", robotstxt).allowed?("/index/wold")
105
+ end
106
+
107
+ def test_bad_unicode
108
+ unless ENV['TRAVIS']
109
+ robotstxt = "User-agent: *\ndisallow: /?id=%C3%CB%D1%CA%A4%C5%D4%BB%C7%D5%B4%D5%E2%CD\n"
110
+ assert true === Robotstxt::Parser.new("Google", robotstxt).allowed?("/index/wold")
111
+ end
112
+ end
113
+
114
+ end
metadata ADDED
@@ -0,0 +1,86 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: robotstxt-parser
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.1.0
5
+ platform: ruby
6
+ authors:
7
+ - Garen Torikian
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+ date: 2014-09-18 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: rake
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - ">="
18
+ - !ruby/object:Gem::Version
19
+ version: '0'
20
+ type: :development
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - ">="
25
+ - !ruby/object:Gem::Version
26
+ version: '0'
27
+ - !ruby/object:Gem::Dependency
28
+ name: fakeweb
29
+ requirement: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - "~>"
32
+ - !ruby/object:Gem::Version
33
+ version: '1.3'
34
+ type: :development
35
+ prerelease: false
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - "~>"
39
+ - !ruby/object:Gem::Version
40
+ version: '1.3'
41
+ description: 'Robotstxt-Parser allows you to the check the accessibility of URLs and
42
+ get other data. Full support for the robots.txt RFC, wildcards and Sitemap: rules.'
43
+ email:
44
+ - gjtorikian@gmail.com
45
+ executables: []
46
+ extensions: []
47
+ extra_rdoc_files: []
48
+ files:
49
+ - ".gitignore"
50
+ - ".travis.yml"
51
+ - Gemfile
52
+ - LICENSE.rdoc
53
+ - README.rdoc
54
+ - Rakefile
55
+ - lib/robotstxt.rb
56
+ - lib/robotstxt/common.rb
57
+ - lib/robotstxt/getter.rb
58
+ - lib/robotstxt/parser.rb
59
+ - robotstxt.gemspec
60
+ - test/getter_test.rb
61
+ - test/parser_test.rb
62
+ homepage: https://github.com/gjtorikian/robotstxt-parser
63
+ licenses:
64
+ - MIT
65
+ metadata: {}
66
+ post_install_message:
67
+ rdoc_options: []
68
+ require_paths:
69
+ - lib
70
+ required_ruby_version: !ruby/object:Gem::Requirement
71
+ requirements:
72
+ - - ">="
73
+ - !ruby/object:Gem::Version
74
+ version: '0'
75
+ required_rubygems_version: !ruby/object:Gem::Requirement
76
+ requirements:
77
+ - - ">="
78
+ - !ruby/object:Gem::Version
79
+ version: '0'
80
+ requirements: []
81
+ rubyforge_project:
82
+ rubygems_version: 2.2.2
83
+ signing_key:
84
+ specification_version: 4
85
+ summary: Robotstxt-parser is an Ruby robots.txt file parser.
86
+ test_files: []