RubyGems - robotstxt-parser - Versions diffs - 0.1.0 - Mend

robotstxt-parser 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (15) hide show

checksums.yaml ADDED Viewed

@@ -0,0 +1,7 @@
+---
+SHA1:
+  metadata.gz: ab84cf493844dcbd92489c277344cf25746adff2
+  data.tar.gz: 285a77121e447cbec3f192eed1a3a7de03bc1bdc
+SHA512:
+  metadata.gz: 567cff3966ac583e462b7e8ab337d27695f76a20a015ddb2ca42c8052823fb5cef03b80000974ce39b2bcbc42d4ad707a826dd87cd2b1468f02fa996a3a4dcee
+  data.tar.gz: 054054c786da1d87adc3f853c4cfa0fc6a24238c2dbd2cb47ce6d3e25babb2bb315e59625cfacfd2d97e2aaad55e39d116f24670ebb09eb1ccdc20f48af625cd

data/.gitignore ADDED Viewed

@@ -0,0 +1,26 @@
+*.gem
+*.rbc
+.bundle
+.config
+coverage
+InstalledFiles
+lib/bundler/man
+pkg
+rdoc
+spec/reports
+test/tmp
+test/version_tmp
+tmp
+Gemfile.lock
+out/
+sample.rb
+run_sample.rb
+src/
+docs/
+# YARD artifacts
+.yardoc
+_yardoc
+doc/
+.DS_Store

data/.travis.yml ADDED Viewed

@@ -0,0 +1,6 @@
+language: ruby
+rvm:
+  - 2.0
+install:
+  - bundle install

data/Gemfile ADDED Viewed

@@ -0,0 +1,3 @@
+source "http://rubygems.org"
+gemspec

data/LICENSE.rdoc ADDED Viewed

@@ -0,0 +1,26 @@
+= License
+(The MIT License)
+Copyright (c) 2010 Conrad Irwin <conrad@rapportive.com>
+Copyright (c) 2009 Simone Rinzivillo <srinzivillo@gmail.com>
+Permission is hereby granted, free of charge, to any person obtaining
+a copy of this software and associated documentation files (the
+"Software"), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
+LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
+OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
+WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

data/README.rdoc ADDED Viewed

@@ -0,0 +1,199 @@
+= Robotstxt
+Robotstxt is an Ruby robots.txt file parser.
+The robots.txt exclusion protocol is a simple mechanism whereby site-owners can guide
+any automated crawlers to relevant parts of their site, and prevent them accessing content
+which is intended only for other eyes. For more information, see http://www.robotstxt.org/.
+This library provides mechanisms for obtaining and parsing the robots.txt file from
+websites. As there is no official "standard" it tries to do something sensible,
+though inspiration was taken from:
+ - http://www.robotstxt.org/orig.html
+ - http://www.robotstxt.org/norobots-rfc.txt
+ - http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=156449&from=35237
+ - http://nikitathespider.com/articles/RobotsTxt.html
+While the parsing semantics of this library are explained below, you should not
+write sitemaps that depend on all robots acting the same -- they simply won't.
+Even the various different ruby libraries support very different subsets of
+functionality.
+This gem builds on the work of Simone Rinzivillo, and is released under the MIT
+license -- see the LICENSE file.
+== Usage
+There are two public points of interest, firstly the Robotstxt module, and
+secondly the Robotstxt::Parser class.
+The Robotstxt module has three public methods:
+ - Robotstxt.get source, user_agent, (options)
+   Returns a Robotstxt::Parser for the robots.txt obtained from source.
+ - Robotstxt.parse robots_txt, user_agent
+   Returns a Robotstxt::Parser for the robots.txt passed in
+ - Robotstxt.get_allowed? urlish, user_agent, (options)
+   Returns true iff the robots.txt obtained from the host identified by the
+   urlish allows the given user agent access to the url.
+The Robotstxt::Parser class contains two pieces of state, the user_agent and the
+text of the robots.txt. In addition its instances have two public methods:
+ - Robotstxt::Parser#allowed? urlish
+   Returns true iff the robots.txt file allows this user_agent access to that
+   url.
+ - Robotstxt::Parser#sitemaps
+   Returns a list of the sitemaps listed in the robots.txt file.
+In the above there are five kinds of parameter,
+    A "urlish" is either a String that represents a URL (suitable for passing to
+    URI.parse) or a URI object, i.e.
+      urlish = "http://www.example.com/"
+      urlish = "/index.html"
+      urlish = https://compicat.ed/home?action=fire#joking"
+      urlish = URI.parse("http://example.co.uk")
+    A "source" is either a "urlish", or a Net::HTTP connection. This allows the
+    library to re-use the same connection when the server respects Keep-alive:
+    headers, i.e.
+      source = Net::HTTP.new("example.com", 80)
+      Net::HTTP.start("example.co.uk", 80) do |http|
+        source = http
+      end
+      source = "http://www.example.com/index.html"
+    When a "urlish" is provided, only the host and port sections are used, and
+    the path is forced to "/robots.txt".
+    A "robots_txt" is the textual content of a robots.txt file that is in the
+    same encoding as the urls you will be fetching (normally utf8).
+    A "user_agent" is the string value you use in your User-agent: header.
+    The "options" is an optional hash containing
+      :num_redirects (5) - the number of redirects to follow before giving up.
+      :http_timeout (10) - the length of time in seconds to wait for one http
+                           request
+      :url_charset (utf8) - the charset which you will use to encode your urls.
+    I recommend not passing the options unless you have to.
+== Examples
+  url = "http://example.com/index.html"
+  if Robotstxt.get_allowed?(url, "Crawler")
+    open(url)
+  end
+  Net::HTTP.start("example.co.uk") do |http|
+    robots = Robotstxt.get(http, "Crawler")
+    if robots.allowed? "/index.html"
+      http.get("/index.html")
+    elsif robots.allowed? "/index.php"
+      http.get("/index.php")
+    end
+  end
+== Details
+=== Request level
+This library handles different HTTP status codes according to the specifications
+on robotstxt.org, in particular:
+If an HTTPUnauthorized or an HTTPForbidden is returned when trying to access
+/robots.txt, then the entire site should be considered "Disallowed".
+If an HTTPRedirection  is returned, it should be followed (though we give up
+after five redirects, to avoid infinite loops).
+If an HTTPSuccess is returned, the body is converted into utf8, and then parsed.
+Any other response, or no response, indicates that there are no Disallowed urls
+no the site.
+=== User-agent matching
+This is case-insensitive, substring matching, i.e. equivalent to matching the
+user agent with /.*thing.*/i.
+Additionally, * characters are interpreted as meaning any number of any character (in
+regular expression idiom: /.*/). Google implies that it does this, at least for
+trailing *s, and the standard implies that "*" is a special user agent meaning
+"everything not referred to so far".
+There can be multiple User-agent: lines for each section of Allow: and Disallow:
+lines in the robots.txt file:
+User-agent: Google
+User-agent: Bing
+Disallow: /secret
+In cases like this, all user-agents inherit the same set of rules.
+=== Path matching
+This is case-sensitive prefix matching, i.e. equivalent to matching the
+requested path (or path + '?' + query) against /^thing.*/. As with user-agents,
+* is interpreted as any number of any character.
+Additionally, when the pattern ends with a $, it forces the pattern to match the
+entire path (or path + ? + query).
+In order to get consistent results, before the globs are matched, the %-encoding
+is normalised so that only /?&= remain %-encoded. For example, /h%65llo/ is the
+same as /hello/, but /ac%2fdc is not the same as /ac/dc - this is due to the
+significance granted to the / operator in urls.
+The paths of the first section that matched our user-agent (by order of
+appearance in the file) are parsed in order of appearance. The first Allow: or
+Disallow: rule that matches the url is accepted. This is prescribed by
+robotstxt.org, but other parsers take wildly different strategies:
+    Google checks all Allows: then all Disallows:
+    Bing checks the most-specific first
+    Others check all Disallows: then all Allows
+As is conventional, a "Disallow: " line with no path given is treated as
+"Allow: *", and if a URL didn't match any path specifiers (or the user-agent
+didn't match any user-agent sections) then that is implicit permission to crawl.
+== TODO
+I would like to add support for the Crawl-delay directive, and indeed any other
+parameters in use.
+== Requirements
+* Ruby >= 1.8.7
+* iconv, net/http and uri
+== Installation
+This library is intended to be installed via the
+RubyGems[http://rubyforge.org/projects/rubygems/] system.
+  $ gem install robotstxt
+You might need administrator privileges on your system to install it.
+== Author
+Author:: {Conrad Irwin} <conrad@rapportive.com>
+Author:: {Simone Rinzivillo}[http://www.simonerinzivillo.it/] <srinzivillo@gmail.com>
+== License
+Robotstxt is released under the MIT license.
+Copyright (c) 2010 Conrad Irwin
+Copyright (c) 2009 Simone Rinzivillo

data/Rakefile ADDED Viewed

@@ -0,0 +1,12 @@
+require 'rake/testtask'
+require 'bundler'
+Bundler::GemHelper.install_tasks
+Rake::TestTask.new do |t|
+  t.libs << "test"
+  t.test_files = FileList['test/*_test.rb']
+  t.verbose = true
+end
+task :default => [:test]

data/lib/robotstxt.rb ADDED Viewed

@@ -0,0 +1,93 @@
+#
+# = Ruby Robotstxt
+#
+# An Ruby Robots.txt parser.
+#
+#
+# Category::    Net
+# Package::     Robotstxt
+# Author::      Conrad Irwin <conrad@rapportive.com>, Simone Rinzivillo <srinzivillo@gmail.com>
+# License::     MIT License
+#
+#--
+#
+#++
+require 'robotstxt/common'
+require 'robotstxt/parser'
+require 'robotstxt/getter'
+# Provides a flexible interface to help authors of web-crawlers
+# respect the robots.txt exclusion standard.
+#
+module Robotstxt
+  NAME            = 'Robotstxt'
+  GEM             = 'robotstxt'
+  AUTHORS         = ['Conrad Irwin <conrad@rapportive.com>', 'Simone Rinzivillo <srinzivillo@gmail.com>']
+  VERSION        = '1.0'
+  # Obtains and parses a robotstxt file from the host identified by source,
+  # source can either be a URI, a string representing a URI, or a Net::HTTP
+  # connection associated with a host.
+  #
+  # The second parameter should be the user-agent header for your robot.
+  #
+  # There are currently two options:
+  #  :num_redirects (default 5) is the maximum number of HTTP 3** responses
+  #   the get() method will accept and follow the Location: header before
+  #   giving up.
+  #  :http_timeout (default 10) is the number of seconds to wait for each
+  #   request before giving up.
+  #  :url_charset (default "utf8") the character encoding you will use to
+  #   encode urls.
+  #
+  # As indicated by robotstxt.org, this library treats HTTPUnauthorized and
+  # HTTPForbidden as though the robots.txt file denied access to the entire
+  # site, all other HTTP responses or errors are treated as though the site
+  # allowed all access.
+  #
+  # The return value is a Robotstxt::Parser, which you can then interact with
+  # by calling .allowed? or .sitemaps. i.e.
+  #
+  # Robotstxt.get("http://example.com/", "SuperRobot").allowed? "/index.html"
+  #
+  # Net::HTTP.open("example.com") do |http|
+  #   if Robotstxt.get(http, "SuperRobot").allowed? "/index.html"
+  #     http.get("/index.html")
+  #   end
+  # end
+  #
+  def self.get(source, robot_id, options={})
+    self.parse(Getter.new.obtain(source, robot_id, options), robot_id)
+  end
+  # Parses the contents of a robots.txt file for the given robot_id
+  #
+  # Returns a Robotstxt::Parser object with methods .allowed? and
+  # .sitemaps, i.e.
+  #
+  # Robotstxt.parse("User-agent: *\nDisallow: /a", "SuperRobot").allowed? "/b"
+  #
+  def self.parse(robotstxt, robot_id)
+    Parser.new(robot_id, robotstxt)
+  end
+  # Gets a robotstxt file from the host identified by the uri
+  #  (which can be a URI object or a string)
+  #
+  # Parses it for the given robot_id
+  #  (which should be your user-agent)
+  #
+  # Returns true iff your robot can access said uri.
+  #
+  # Robotstxt.get_allowed? "http://www.example.com/good", "SuperRobot"
+  #
+  def self.get_allowed?(uri, robot_id)
+    self.get(uri, robot_id).allowed? uri
+  end
+  def self.ultimate_scrubber(str)
+    str.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => '')
+  end
+end

data/lib/robotstxt/common.rb ADDED Viewed

@@ -0,0 +1,25 @@
+require 'uri'
+require 'net/http'
+module Robotstxt
+  module CommonMethods
+    protected
+    # Convert a URI or a String into a URI
+    def objectify_uri(uri)
+        if uri.is_a? String
+          # URI.parse will explode when given a character that it thinks
+          # shouldn't appear in uris. We thus escape them before passing the
+          # string into the function. Unfortunately URI.escape does not respect
+          # all characters that have meaning in HTTP (esp. #), so we are forced
+          # to state exactly which characters we would like to escape.
+          uri = URI.escape(uri, %r{[^!$#%&'()*+,\-./0-9:;=?@A-Z_a-z~]})
+          uri = URI.parse(uri)
+        else
+          uri
+        end
+    end
+  end
+end

data/lib/robotstxt/getter.rb ADDED Viewed

@@ -0,0 +1,79 @@
+module Robotstxt
+  class Getter
+    include CommonMethods
+    # Get the text of a robots.txt file from the given source, see #get.
+    def obtain(source, robot_id, options)
+      options = {
+        :num_redirects => 5,
+        :http_timeout => 10
+      }.merge(options)
+      robotstxt = if source.is_a? Net::HTTP
+        obtain_via_http(source, "/robots.txt", robot_id, options)
+      else
+        uri = objectify_uri(source)
+        http = Net::HTTP.new(uri.host, uri.port)
+        http.read_timeout = options[:http_timeout]
+        if uri.scheme == 'https'
+          http.use_ssl = true
+          http.verify_mode = OpenSSL::SSL::VERIFY_NONE
+        end
+        obtain_via_http(http, "/robots.txt", robot_id, options)
+      end
+    end
+    protected
+    # Recursively try to obtain robots.txt following redirects and handling the
+    # various HTTP response codes as indicated on robotstxt.org
+    def obtain_via_http(http, uri, robot_id, options)
+      response = http.get(uri, {'User-Agent' => robot_id})
+      begin
+        case response
+        when Net::HTTPSuccess
+          decode_body(response)
+        when Net::HTTPRedirection
+          if options[:num_redirects] > 0 && response['location']
+            options[:num_redirects] -= 1
+            obtain(response['location'], robot_id, options)
+          else
+            all_allowed
+          end
+        when Net::HTTPUnauthorized
+          all_forbidden
+        when Net::HTTPForbidden
+          all_forbidden
+        else
+          all_allowed
+        end
+      rescue Timeout::Error #, StandardError
+        all_allowed
+      end
+    end
+    # A robots.txt body that forbids access to everywhere
+    def all_forbidden
+      "User-agent: *\nDisallow: /\n"
+    end
+    # A robots.txt body that allows access to everywhere
+    def all_allowed
+      "User-agent: *\nDisallow:\n"
+    end
+    # Decode the response's body according to the character encoding in the HTTP
+    # headers.
+    # In the case that we can't decode, Ruby's laissez faire attitude to encoding
+    # should mean that we have a reasonable chance of working anyway.
+    def decode_body(response)
+      return nil if response.body.nil?
+      Robotstxt.ultimate_scrubber(response.body)
+    end
+  end
+end

data/lib/robotstxt/parser.rb ADDED Viewed

@@ -0,0 +1,256 @@
+module Robotstxt
+  # Parses robots.txt files for the perusal of a single user-agent.
+  #
+  # The behaviour implemented is guided by the following sources, though
+  # as there is no widely accepted standard, it may differ from other implementations.
+  # If you consider its behaviour to be in error, please contact the author.
+  #
+  # http://www.robotstxt.org/orig.html
+  #  - the original, now imprecise and outdated version
+  # http://www.robotstxt.org/norobots-rfc.txt
+  #  - a much more precise, outdated version
+  # http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=156449&from=35237
+  #  - a few hints at modern protocol extensions.
+  #
+  # This parser only considers lines starting with (case-insensitively:)
+  #  Useragent: User-agent: Allow: Disallow: Sitemap:
+  #
+  # The file is divided into sections, each of which contains one or more User-agent:
+  # lines, followed by one or more Allow: or Disallow: rules.
+  #
+  # The first section that contains a User-agent: line that matches the robot's
+  # user-agent, is the only section that relevent to that robot. The sections are checked
+  # in the same order as they appear in the file.
+  #
+  # (The * character is taken to mean "any number of any characters" during matching of
+  #  user-agents)
+  #
+  # Within that section, the first Allow: or Disallow: rule that matches the expression
+  # is taken as authoritative. If no rule in a section matches, the access is Allowed.
+  #
+  # (The order of matching is as in the RFC, Google matches all Allows and then all Disallows,
+  #  while Bing matches the most specific rule, I'm sure there are other interpretations)
+  #
+  # When matching urls, all % encodings are normalised (except for /?=& which have meaning)
+  # and "*"s match any number of any character.
+  #
+  # If a pattern ends with a $, then the pattern must match the entire path, or the entire
+  # path with query string.
+  #
+  class Parser
+    include CommonMethods
+    # Gets every Sitemap mentioned in the body of the robots.txt file.
+    #
+    attr_reader :sitemaps
+    # Create a new parser for this user_agent and this robots.txt contents.
+    #
+    # This assumes that the robots.txt is ready-to-parse, in particular that
+    # it has been decoded as necessary, including removal of byte-order-marks et.al.
+    #
+    # Not passing a body is deprecated, but retained for compatibility with clients
+    # written for version 0.5.4.
+    #
+    def initialize(user_agent, body)
+      @robot_id = user_agent
+      @found = true
+      parse(body) # set @body, @rules and @sitemaps
+    end
+    # Given a URI object, or a string representing one, determine whether this
+    # robots.txt would allow access to the path.
+    def allowed?(uri)
+      uri = objectify_uri(uri)
+      path = (uri.path || "/") + (uri.query ? '?' + uri.query : '')
+      path_allowed?(@robot_id, path)
+    end
+    protected
+    # Check whether the relative path (a string of the url's path and query
+    # string) is allowed by the rules we have for the given user_agent.
+    #
+    def path_allowed?(user_agent, path)
+      @rules.each do |(ua_glob, path_globs)|
+        if match_ua_glob user_agent, ua_glob
+          path_globs.each do |(path_glob, allowed)|
+            return allowed if match_path_glob path, path_glob
+          end
+          return true
+        end
+      end
+      true
+    end
+    # This does a case-insensitive substring match such that if the user agent
+    # is contained within the glob, or vice-versa, we will match.
+    #
+    # According to the standard, *s shouldn't appear in the user-agent field
+    # except in the case of "*" meaning all user agents. Google however imply
+    # that the * will work, at least at the end of a string.
+    #
+    # For consistency, and because it seems expected behaviour, and because
+    # a glob * will match a literal * we use glob matching not string matching.
+    #
+    # The standard also advocates a substring match of the robot's user-agent
+    # within the user-agent field. From observation, it seems much more likely
+    # that the match will be the other way about, though we check for both.
+    #
+    def match_ua_glob(user_agent, glob)
+      glob =~ Regexp.new(Regexp.escape(user_agent), "i") ||
+          user_agent =~ Regexp.new(reify(glob), "i")
+    end
+    # This does case-sensitive prefix matching, such that if the path starts
+    # with the glob, we will match.
+    #
+    # According to the standard, that's it. However, it seems reasonably common
+    # for asterkisks to be interpreted as though they were globs.
+    #
+    # Additionally, some search engines, like Google, will treat a trailing $
+    # sign as forcing the glob to match the entire path - whether including
+    # or excluding the query string is not clear, so we check both.
+    #
+    # (i.e. it seems likely that a site owner who has Disallow: *.pdf$ expects
+    # to disallow requests to *.pdf?i_can_haz_pdf, which the robot could, if
+    # it were feeling malicious, construe.)
+    #
+    # With URLs there is the additional complication that %-encoding can give
+    # multiple representations for identical URLs, this is handled by
+    # normalize_percent_encoding.
+    #
+    def match_path_glob(path, glob)
+      if glob =~ /\$$/
+        end_marker = '(?:\?|$)'
+        glob = glob.gsub /\$$/, ""
+      else
+        end_marker = ""
+      end
+      glob = Robotstxt.ultimate_scrubber normalize_percent_encoding(glob)
+      path = Robotstxt.ultimate_scrubber normalize_percent_encoding(path)
+      path =~ Regexp.new("^" + reify(glob) + end_marker)
+    # Some people encode bad UTF-8 in their robots.txt files, let us not behave badly.
+    rescue RegexpError
+      false
+    end
+    # As a general rule, we want to ignore different representations of the
+    # same URL. Naively we could just unescape, or escape, everything, however
+    # the standard implies that a / is a HTTP path separator, while a %2F is an
+    # encoded / that does not act as a path separator. Similar issues with ?, &
+    # and =, though all other characters are fine. (While : also has a special
+    # meaning in HTTP, most implementations ignore this in the path)
+    #
+    # It's also worth noting that %-encoding is case-insensitive, so we
+    # explicitly upcase the few that we want to keep.
+    #
+    def normalize_percent_encoding(path)
+      # First double-escape any characters we don't want to unescape
+      #                   &  /  =  ?
+      path = path.gsub(/%(26|2F|3D|3F)/i) do |code|
+        "%25#{code.upcase}"
+      end
+      URI.unescape(path)
+    end
+    # Convert the asterisks in a glob into (.*)s for regular expressions,
+    # and at the same time, escape any other characters that would have
+    # a significance in a regex.
+    #
+    def reify(glob)
+      glob = Robotstxt.ultimate_scrubber(glob)
+      # -1 on a split prevents trailing empty strings from being deleted.
+      glob.split("*", -1).map{ |part| Regexp.escape(part) }.join(".*")
+    end
+    # Convert the @body into a set of @rules so that our parsing mechanism
+    # becomes easier.
+    #
+    # @rules is an array of pairs. The first in the pair is the glob for the
+    # user-agent and the second another array of pairs. The first of the new
+    # pair is a glob for the path, and the second whether it appears in an
+    # Allow: or a Disallow: rule.
+    #
+    # For example:
+    #
+    # User-agent: *
+    # Disallow: /secret/
+    # Allow: /     # allow everything...
+    #
+    # Would be parsed so that:
+    #
+    # @rules = [["*", [ ["/secret/", false], ["/", true] ]]]
+    #
+    #
+    # The order of the arrays is maintained so that the first match in the file
+    # is obeyed as indicated by the pseudo-RFC on http://robotstxt.org/. There
+    # are alternative interpretations, some parse by speicifity of glob, and
+    # some check Allow lines for any match before Disallow lines. All are
+    # justifiable, but we could only pick one.
+    #
+    # Note that a blank Disallow: should be treated as an Allow: * and multiple
+    # user-agents may share the same set of rules.
+    #
+    def parse(body)
+      @body = Robotstxt.ultimate_scrubber(body)
+      @rules = []
+      @sitemaps = []
+      body.split(/[\r\n]+/).each do |line|
+        prefix, value = line.delete("\000").split(":", 2).map(&:strip)
+        value.sub! /\s+#.*/, '' if value
+        parser_mode = :begin
+        if prefix && value
+          case prefix.downcase
+            when /^user-?agent$/
+              if parser_mode == :user_agent
+                @rules << [value, rules.last[1]]
+              else
+                parser_mode = :user_agent
+                @rules << [value, []]
+              end
+            when "disallow"
+              parser_mode = :rules
+              @rules << ["*", []] if @rules.empty?
+              if value == ""
+                @rules.last[1] << ["*", true]
+              else
+                @rules.last[1] << [value, false]
+              end
+            when "allow"
+              parser_mode = :rules
+              @rules << ["*", []] if @rules.empty?
+              @rules.last[1] << [value, true]
+            when "sitemap"
+              @sitemaps << value
+            else
+              # Ignore comments, Crawl-delay: and badly formed lines.
+          end
+        end
+      end
+    end
+  end
+end

data/robotstxt.gemspec ADDED Viewed

@@ -0,0 +1,19 @@
+# -*- encoding: utf-8 -*-
+$:.push File.expand_path("../lib", __FILE__)
+Gem::Specification.new do |gem|
+  gem.name          = "robotstxt-parser"
+  gem.version       = "0.1.0"
+  gem.authors       = ["Garen Torikian"]
+  gem.email         = ["gjtorikian@gmail.com"]
+  gem.description   = %q{Robotstxt-Parser allows you to the check the accessibility of URLs and get other data. Full support for the robots.txt RFC, wildcards and Sitemap: rules.}
+  gem.summary       = %q{Robotstxt-parser is an Ruby robots.txt file parser.}
+  gem.homepage      = "https://github.com/gjtorikian/robotstxt-parser"
+  gem.license       = "MIT"
+  gem.files         = `git ls-files`.split($/)
+  gem.test_files    = gem.files.grep(%r{^(text)/})
+  gem.require_paths = ["lib"]
+  gem.add_development_dependency "rake"
+  gem.add_development_dependency "fakeweb", '~> 1.3'
+end

data/test/getter_test.rb ADDED Viewed

@@ -0,0 +1,74 @@
+# -*- encoding: utf-8 -*-
+$:.unshift(File.dirname(__FILE__) + '/../lib')
+require 'rubygems'
+require 'test/unit'
+require 'robotstxt'
+require 'fakeweb'
+FakeWeb.allow_net_connect = false
+class TestRobotstxt < Test::Unit::TestCase
+  def test_absense
+    FakeWeb.register_uri(:get, "http://example.com/robots.txt", :status => ["404", "Not found"])
+    assert true == Robotstxt.get_allowed?("http://example.com/index.html", "Google")
+  end
+  def test_error
+    FakeWeb.register_uri(:get, "http://example.com/robots.txt", :status => ["500", "Internal Server Error"])
+    assert true == Robotstxt.get_allowed?("http://example.com/index.html", "Google")
+  end
+  def test_unauthorized
+    FakeWeb.register_uri(:get, "http://example.com/robots.txt", :status => ["401", "Unauthorized"])
+    assert false == Robotstxt.get_allowed?("http://example.com/index.html", "Google")
+  end
+  def test_forbidden
+    FakeWeb.register_uri(:get, "http://example.com/robots.txt", :status => ["403", "Forbidden"])
+    assert false == Robotstxt.get_allowed?("http://example.com/index.html", "Google")
+  end
+  def test_uri_object
+    FakeWeb.register_uri(:get, "http://example.com/robots.txt", :body => "User-agent:*\nDisallow: /test")
+    robotstxt = Robotstxt.get(URI.parse("http://example.com/index.html"), "Google")
+    assert true == robotstxt.allowed?("/index.html")
+    assert false == robotstxt.allowed?("/test/index.html")
+  end
+  def test_existing_http_connection
+    FakeWeb.register_uri(:get, "http://example.com/robots.txt", :body => "User-agent:*\nDisallow: /test")
+    http = Net::HTTP.start("example.com", 80) do |http|
+      robotstxt = Robotstxt.get(http, "Google")
+      assert true == robotstxt.allowed?("/index.html")
+      assert false == robotstxt.allowed?("/test/index.html")
+    end
+  end
+  def test_redirects
+    FakeWeb.register_uri(:get, "http://example.com/robots.txt", :response => "HTTP/1.1 303 See Other\nLocation: http://www.exemplar.com/robots.txt\n\n")
+    FakeWeb.register_uri(:get, "http://www.exemplar.com/robots.txt", :body => "User-agent:*\nDisallow: /private")
+    robotstxt = Robotstxt.get("http://example.com/", "Google")
+    assert true == robotstxt.allowed?("/index.html")
+    assert false == robotstxt.allowed?("/private/index.html")
+  end
+  def test_encoding
+    # "User-agent: *\n Disallow: /encyclop@dia" where @ is the ae ligature (U+00E6)
+    FakeWeb.register_uri(:get, "http://example.com/robots.txt", :response => "HTTP/1.1 200 OK\nContent-type: text/plain; charset=utf-16\n\n" +
+        "\xff\xfeU\x00s\x00e\x00r\x00-\x00a\x00g\x00e\x00n\x00t\x00:\x00 \x00*\x00\n\x00D\x00i\x00s\x00a\x00l\x00l\x00o\x00w\x00:\x00 \x00/\x00e\x00n\x00c\x00y\x00c\x00l\x00o\x00p\x00\xe6\x00d\x00i\x00a\x00")
+    robotstxt = Robotstxt.get("http://example.com/#index", "Google")
+    assert true == robotstxt.allowed?("/index.html")
+    assert false == robotstxt.allowed?("/encyclop%c3%a6dia/index.html")
+  end
+end

data/test/parser_test.rb ADDED Viewed

@@ -0,0 +1,114 @@
+# -*- encoding: utf-8 -*-
+$:.unshift(File.dirname(__FILE__) + '/../lib')
+require 'test/unit'
+require 'robotstxt'
+require 'cgi'
+class TestParser < Test::Unit::TestCase
+  def test_basics
+    client = Robotstxt::Parser.new("Test", <<-ROBOTS
+User-agent: *
+Disallow: /?*\t\t\t#comment
+Disallow: /home
+Disallow: /dashboard
+Disallow: /terms-conditions
+Disallow: /privacy-policy
+Disallow: /index.php
+Disallow: /chargify_system
+Disallow: /test*
+Disallow: /team*     # comment
+Disallow: /index
+Allow: /    # comment
+Sitemap: http://example.com/sitemap.xml
+ROBOTS
+)
+    assert true == client.allowed?("/")
+    assert false == client.allowed?("/?")
+    assert false == client.allowed?("/?key=value")
+    assert true == client.allowed?("/example")
+    assert true == client.allowed?("/example/index.php")
+    assert false == client.allowed?("/test")
+    assert false == client.allowed?("/test/example")
+    assert false == client.allowed?("/team-game")
+    assert false == client.allowed?("/team-game/example")
+    assert ["http://example.com/sitemap.xml"] == client.sitemaps
+  end
+  def test_blank_disallow
+    google = Robotstxt::Parser.new("Google", <<-ROBOTSTXT
+User-agent: *
+Disallow:
+ROBOTSTXT
+                                  )
+    assert true == google.allowed?("/")
+    assert true == google.allowed?("/index.html")
+  end
+  def test_url_escaping
+    google = Robotstxt::Parser.new("Google", <<-ROBOTSTXT
+User-agent: *
+Disallow: /test/
+Disallow: /secret%2Fgarden/
+Disallow: /%61lpha/
+ROBOTSTXT
+)
+    assert true == google.allowed?("/allowed/")
+    assert false == google.allowed?("/test/")
+    assert true == google.allowed?("/test%2Fetc/")
+    assert false == google.allowed?("/secret%2fgarden/")
+    assert true == google.allowed?("/secret/garden/")
+    assert false == google.allowed?("/alph%61/")
+  end
+  def test_trail_matching
+    google = Robotstxt::Parser.new("Google", <<-ROBOTSTXT
+User-agent: *
+   #comments
+Disallow: /*.pdf$
+ROBOTSTXT
+)
+    assert true == google.allowed?("/.pdfs/index.html")
+    assert false == google.allowed?("/.pdfs/index.pdf")
+    assert false == google.allowed?("/.pdfs/index.pdf?action=view")
+    assert false == google.allowed?("/.pdfs/index.html?download_as=.pdf")
+  end
+  def test_useragents
+    robotstxt = <<-ROBOTS
+User-agent: Google
+User-agent: Yahoo
+Disallow:
+User-agent: *
+Disallow: /
+ROBOTS
+    assert true == Robotstxt::Parser.new("Google", robotstxt).allowed?("/hello")
+    assert true == Robotstxt::Parser.new("Yahoo", robotstxt).allowed?("/hello")
+    assert false == Robotstxt::Parser.new("Bing", robotstxt).allowed?("/hello")
+  end
+  def test_missing_useragent
+    robotstxt = <<-ROBOTS
+Disallow: /index
+ROBOTS
+    assert true === Robotstxt::Parser.new("Google", robotstxt).allowed?("/hello")
+    assert false === Robotstxt::Parser.new("Google", robotstxt).allowed?("/index/wold")
+  end
+  def test_strange_newlines
+    robotstxt = "User-agent: *\r\r\rDisallow: *"
+    assert false === Robotstxt::Parser.new("Google", robotstxt).allowed?("/index/wold")
+  end
+  def test_bad_unicode
+    unless ENV['TRAVIS']
+      robotstxt = "User-agent: *\ndisallow: /?id=%C3%CB%D1%CA%A4%C5%D4%BB%C7%D5%B4%D5%E2%CD\n"
+      assert true === Robotstxt::Parser.new("Google", robotstxt).allowed?("/index/wold")
+    end
+  end
+end

metadata ADDED Viewed

@@ -0,0 +1,86 @@
+--- !ruby/object:Gem::Specification
+name: robotstxt-parser
+version: !ruby/object:Gem::Version
+  version: 0.1.0
+platform: ruby
+authors:
+- Garen Torikian
+autorequire:
+bindir: bin
+cert_chain: []
+date: 2014-09-18 00:00:00.000000000 Z
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: rake
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: fakeweb
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '1.3'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '1.3'
+description: 'Robotstxt-Parser allows you to the check the accessibility of URLs and
+  get other data. Full support for the robots.txt RFC, wildcards and Sitemap: rules.'
+email:
+- gjtorikian@gmail.com
+executables: []
+extensions: []
+extra_rdoc_files: []
+files:
+- ".gitignore"
+- ".travis.yml"
+- Gemfile
+- LICENSE.rdoc
+- README.rdoc
+- Rakefile
+- lib/robotstxt.rb
+- lib/robotstxt/common.rb
+- lib/robotstxt/getter.rb
+- lib/robotstxt/parser.rb
+- robotstxt.gemspec
+- test/getter_test.rb
+- test/parser_test.rb
+homepage: https://github.com/gjtorikian/robotstxt-parser
+licenses:
+- MIT
+metadata: {}
+post_install_message:
+rdoc_options: []
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '0'
+required_rubygems_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '0'
+requirements: []
+rubyforge_project:
+rubygems_version: 2.2.2
+signing_key:
+specification_version: 4
+summary: Robotstxt-parser is an Ruby robots.txt file parser.
+test_files: []