RubyGems - robotstxt-parser - Versions diffs - 0.1.0 - Mend

robotstxt-parser 0.1.0

Files changed (15) hide show

checksums.yaml ADDED Viewed

@@ -0,0 +1,7 @@
+---
+SHA1:
+  metadata.gz: ab84cf493844dcbd92489c277344cf25746adff2
+  data.tar.gz: 285a77121e447cbec3f192eed1a3a7de03bc1bdc
+SHA512:
+  metadata.gz: 567cff3966ac583e462b7e8ab337d27695f76a20a015ddb2ca42c8052823fb5cef03b80000974ce39b2bcbc42d4ad707a826dd87cd2b1468f02fa996a3a4dcee
+  data.tar.gz: 054054c786da1d87adc3f853c4cfa0fc6a24238c2dbd2cb47ce6d3e25babb2bb315e59625cfacfd2d97e2aaad55e39d116f24670ebb09eb1ccdc20f48af625cd

data/.gitignore ADDED Viewed

@@ -0,0 +1,26 @@
+*.gem
+*.rbc
+.bundle
+.config
+coverage
+InstalledFiles
+lib/bundler/man
+pkg
+rdoc
+spec/reports
+test/tmp
+test/version_tmp
+tmp
+Gemfile.lock
+out/
+sample.rb
+run_sample.rb
+src/
+docs/
+# YARD artifacts
+.yardoc
+_yardoc
+doc/
+.DS_Store

data/.travis.yml ADDED Viewed

@@ -0,0 +1,6 @@
+language: ruby
+rvm:
+  - 2.0
+install:
+  - bundle install

data/Gemfile ADDED Viewed

@@ -0,0 +1,3 @@
+source "http://rubygems.org"
+gemspec

data/LICENSE.rdoc ADDED Viewed

@@ -0,0 +1,26 @@
+= License
+(The MIT License)
+Copyright (c) 2010 Conrad Irwin <conrad@rapportive.com>
+Copyright (c) 2009 Simone Rinzivillo <srinzivillo@gmail.com>
+Permission is hereby granted, free of charge, to any person obtaining
+a copy of this software and associated documentation files (the
+"Software"), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
+LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
+OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
+WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

data/README.rdoc ADDED Viewed

@@ -0,0 +1,199 @@
+= Robotstxt
+Robotstxt is an Ruby robots.txt file parser.
+The robots.txt exclusion protocol is a simple mechanism whereby site-owners can guide
+any automated crawlers to relevant parts of their site, and prevent them accessing content
+which is intended only for other eyes. For more information, see http://www.robotstxt.org/.
+This library provides mechanisms for obtaining and parsing the robots.txt file from
+websites. As there is no official "standard" it tries to do something sensible,
+though inspiration was taken from:
+ - http://www.robotstxt.org/orig.html
+ - http://www.robotstxt.org/norobots-rfc.txt
+ - http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=156449&from=35237
+ - http://nikitathespider.com/articles/RobotsTxt.html
+While the parsing semantics of this library are explained below, you should not
+write sitemaps that depend on all robots acting the same -- they simply won't.
+Even the various different ruby libraries support very different subsets of
+functionality.
+This gem builds on the work of Simone Rinzivillo, and is released under the MIT
+license -- see the LICENSE file.
+== Usage
+There are two public points of interest, firstly the Robotstxt module, and
+secondly the Robotstxt::Parser class.
+The Robotstxt module has three public methods:
+ - Robotstxt.get source, user_agent, (options)
+   Returns a Robotstxt::Parser for the robots.txt obtained from source.
+ - Robotstxt.parse robots_txt, user_agent
+   Returns a Robotstxt::Parser for the robots.txt passed in
+ - Robotstxt.get_allowed? urlish, user_agent, (options)
+   Returns true iff the robots.txt obtained from the host identified by the
+   urlish allows the given user agent access to the url.
+The Robotstxt::Parser class contains two pieces of state, the user_agent and the
+text of the robots.txt. In addition its instances have two public methods:
+ - Robotstxt::Parser#allowed? urlish
+   Returns true iff the robots.txt file allows this user_agent access to that
+   url.
+ - Robotstxt::Parser#sitemaps
+   Returns a list of the sitemaps listed in the robots.txt file.
+In the above there are five kinds of parameter,
+    A "urlish" is either a String that represents a URL (suitable for passing to
+    URI.parse) or a URI object, i.e.
+      urlish = "http://www.example.com/"
+      urlish = "/index.html"
+      urlish = https://compicat.ed/home?action=fire#joking"
+      urlish = URI.parse("http://example.co.uk")
+    A "source" is either a "urlish", or a Net::HTTP connection. This allows the
+    library to re-use the same connection when the server respects Keep-alive:
+    headers, i.e.
+      source = Net::HTTP.new("example.com", 80)
+      Net::HTTP.start("example.co.uk", 80) do |http|
+        source = http
+      end
+      source = "http://www.example.com/index.html"
+    When a "urlish" is provided, only the host and port sections are used, and
+    the path is forced to "/robots.txt".
+    A "robots_txt" is the textual content of a robots.txt file that is in the
+    same encoding as the urls you will be fetching (normally utf8).
+    A "user_agent" is the string value you use in your User-agent: header.
+    The "options" is an optional hash containing
+      :num_redirects (5) - the number of redirects to follow before giving up.
+      :http_timeout (10) - the length of time in seconds to wait for one http
+                           request
+      :url_charset (utf8) - the charset which you will use to encode your urls.
+    I recommend not passing the options unless you have to.
+== Examples
+  url = "http://example.com/index.html"
+  if Robotstxt.get_allowed?(url, "Crawler")
+    open(url)
+  end
+  Net::HTTP.start("example.co.uk") do |http|
+    robots = Robotstxt.get(http, "Crawler")
+    if robots.allowed? "/index.html"
+      http.get("/index.html")
+    elsif robots.allowed? "/index.php"
+      http.get("/index.php")
+    end
+  end
+== Details
+=== Request level
+This library handles different HTTP status codes according to the specifications
+on robotstxt.org, in particular:
+If an HTTPUnauthorized or an HTTPForbidden is returned when trying to access
+/robots.txt, then the entire site should be considered "Disallowed".
+If an HTTPRedirection  is returned, it should be followed (though we give up
+after five redirects, to avoid infinite loops).
+If an HTTPSuccess is returned, the body is converted into utf8, and then parsed.
+Any other response, or no response, indicates that there are no Disallowed urls
+no the site.
+=== User-agent matching
+This is case-insensitive, substring matching, i.e. equivalent to matching the
+user agent with /.*thing.*/i.
+Additionally, * characters are interpreted as meaning any number of any character (in
+regular expression idiom: /.*/). Google implies that it does this, at least for
+trailing *s, and the standard implies that "*" is a special user agent meaning
+"everything not referred to so far".
+There can be multiple User-agent: lines for each section of Allow: and Disallow:
+lines in the robots.txt file:
+User-agent: Google
+User-agent: Bing
+Disallow: /secret
+In cases like this, all user-agents inherit the same set of rules.
+=== Path matching
+This is case-sensitive prefix matching, i.e. equivalent to matching the
+requested path (or path + '?' + query) against /^thing.*/. As with user-agents,
+* is interpreted as any number of any character.
+Additionally, when the pattern ends with a $, it forces the pattern to match the
+entire path (or path + ? + query).
+In order to get consistent results, before the globs are matched, the %-encoding
+is normalised so that only /?&= remain %-encoded. For example, /h%65llo/ is the
+same as /hello/, but /ac%2fdc is not the same as /ac/dc - this is due to the
+significance granted to the / operator in urls.
+The paths of the first section that matched our user-agent (by order of
+appearance in the file) are parsed in order of appearance. The first Allow: or
+Disallow: rule that matches the url is accepted. This is prescribed by
+robotstxt.org, but other parsers take wildly different strategies:
+    Google checks all Allows: then all Disallows:
+    Bing checks the most-specific first
+    Others check all Disallows: then all Allows
+As is conventional, a "Disallow: " line with no path given is treated as
+"Allow: *", and if a URL didn't match any path specifiers (or the user-agent
+didn't match any user-agent sections) then that is implicit permission to crawl.
+== TODO
+I would like to add support for the Crawl-delay directive, and indeed any other
+parameters in use.
+== Requirements
+* Ruby >= 1.8.7
+* iconv, net/http and uri
+== Installation
+This library is intended to be installed via the
+RubyGems[http://rubyforge.org/projects/rubygems/] system.
+  $ gem install robotstxt
+You might need administrator privileges on your system to install it.
+== Author
+Author:: {Conrad Irwin} <conrad@rapportive.com>
+Author:: {Simone Rinzivillo}[http://www.simonerinzivillo.it/] <srinzivillo@gmail.com>
+== License
+Robotstxt is released under the MIT license.
+Copyright (c) 2010 Conrad Irwin
+Copyright (c) 2009 Simone Rinzivillo

data/Rakefile ADDED Viewed

@@ -0,0 +1,12 @@
+require 'rake/testtask'
+require 'bundler'
+Bundler::GemHelper.install_tasks
+Rake::TestTask.new do |t|
+  t.libs << "test"
+  t.test_files = FileList['test/*_test.rb']
+  t.verbose = true
+end
+task :default => [:test]

data/lib/robotstxt.rb ADDED Viewed

@@ -0,0 +1,93 @@
+#
+# = Ruby Robotstxt
+#
+# An Ruby Robots.txt parser.
+#
+#
+# Category::    Net
+# Package::     Robotstxt
+# Author::      Conrad Irwin <conrad@rapportive.com>, Simone Rinzivillo <srinzivillo@gmail.com>
+# License::     MIT License
+#
+#--
+#
+#++
+require 'robotstxt/common'
+require 'robotstxt/parser'
+require 'robotstxt/getter'
+# Provides a flexible interface to help authors of web-crawlers
+# respect the robots.txt exclusion standard.
+#
+module Robotstxt
+  NAME            = 'Robotstxt'
+  GEM             = 'robotstxt'
+  AUTHORS         = ['Conrad Irwin <conrad@rapportive.com>', 'Simone Rinzivillo <srinzivillo@gmail.com>']
+  VERSION        = '1.0'
+  # Obtains and parses a robotstxt file from the host identified by source,
+  # source can either be a URI, a string representing a URI, or a Net::HTTP
+  # connection associated with a host.
+  #
+  # The second parameter should be the user-agent header for your robot.
+  #
+  # There are currently two options:
+  #  :num_redirects (default 5) is the maximum number of HTTP 3** responses
+  #   the get() method will accept and follow the Location: header before
+  #   giving up.
+  #  :http_timeout (default 10) is the number of seconds to wait for each
+  #   request before giving up.
+  #  :url_charset (default "utf8") the character encoding you will use to
+  #   encode urls.
+  #
+  # As indicated by robotstxt.org, this library treats HTTPUnauthorized and
+  # HTTPForbidden as though the robots.txt file denied access to the entire
+  # site, all other HTTP responses or errors are treated as though the site
+  # allowed all access.
+  #
+  # The return value is a Robotstxt::Parser, which you can then interact with
+  # by calling .allowed? or .sitemaps. i.e.
+  #
+  # Robotstxt.get("http://example.com/", "SuperRobot").allowed? "/index.html"
+  #
+  # Net::HTTP.open("example.com") do |http|
+  #   if Robotstxt.get(http, "SuperRobot").allowed? "/index.html"
+  #     http.get("/index.html")
+  #   end
+  # end
+  #
+  def self.get(source, robot_id, options={})
+    self.parse(Getter.new.obtain(source, robot_id, options), robot_id)
+  end
+  # Parses the contents of a robots.txt file for the given robot_id
+  #
+  # Returns a Robotstxt::Parser object with methods .allowed? and
+  # .sitemaps, i.e.
+  #
+  # Robotstxt.parse("User-agent: *\nDisallow: /a", "SuperRobot").allowed? "/b"
+  #
+  def self.parse(robotstxt, robot_id)
+    Parser.new(robot_id, robotstxt)
+  end
+  # Gets a robotstxt file from the host identified by the uri
+  #  (which can be a URI object or a string)
+  #
+  # Parses it for the given robot_id
+  #  (which should be your user-agent)
+  #
+  # Returns true iff your robot can access said uri.
+  #
+  # Robotstxt.get_allowed? "http://www.example.com/good", "SuperRobot"
+  #
+  def self.get_allowed?(uri, robot_id)
+    self.get(uri, robot_id).allowed? uri
+  end
+  def self.ultimate_scrubber(str)
+    str.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => '')
+  end
+end

data/lib/robotstxt/common.rb ADDED Viewed

@@ -0,0 +1,25 @@
+require 'uri'
+require 'net/http'
+module Robotstxt
+  module CommonMethods
+    protected
+    # Convert a URI or a String into a URI
+    def objectify_uri(uri)
+        if uri.is_a? String
+          # URI.parse will explode when given a character that it thinks
+          # shouldn't appear in uris. We thus escape them before passing the
+          # string into the function. Unfortunately URI.escape does not respect
+          # all characters that have meaning in HTTP (esp. #), so we are forced
+          # to state exactly which characters we would like to escape.
+          uri = URI.escape(uri, %r{[^!$#%&'()*+,\-./0-9:;=?@A-Z_a-z~]})
+          uri = URI.parse(uri)
+        else
+          uri
+        end
+    end
+  end
+end

data/lib/robotstxt/getter.rb ADDED Viewed

@@ -0,0 +1,79 @@
+module Robotstxt
+  class Getter
+    include CommonMethods
+    # Get the text of a robots.txt file from the given source, see #get.
+    def obtain(source, robot_id, options)
+      options = {
+        :num_redirects => 5,
+        :http_timeout => 10
+      }.merge(options)
+      robotstxt = if source.is_a? Net::HTTP
+        obtain_via_http(source, "/robots.txt", robot_id, options)
+      else
+        uri = objectify_uri(source)
+        http = Net::HTTP.new(uri.host, uri.port)
+        http.read_timeout = options[:http_timeout]
+        if uri.scheme == 'https'
+          http.use_ssl = true
+          http.verify_mode = OpenSSL::SSL::VERIFY_NONE
+        end
+        obtain_via_http(http, "/robots.txt", robot_id, options)
+      end
+    end
+    protected
+    # Recursively try to obtain robots.txt following redirects and handling the
+    # various HTTP response codes as indicated on robotstxt.org
+    def obtain_via_http(http, uri, robot_id, options)
+      response = http.get(uri, {'User-Agent' => robot_id})
+      begin
+        case response
+        when Net::HTTPSuccess
+          decode_body(response)
+        when Net::HTTPRedirection
+          if options[:num_redirects] > 0 && response['location']
+            options[:num_redirects] -= 1
+            obtain(response['location'], robot_id, options)
+          else
+            all_allowed
+          end
+        when Net::HTTPUnauthorized
+          all_forbidden
+        when Net::HTTPForbidden
+          all_forbidden
+        else
+          all_allowed
+        end
+      rescue Timeout::Error #, StandardError
+        all_allowed
+      end
+    end
+    # A robots.txt body that forbids access to everywhere
+    def all_forbidden
+      "User-agent: *\nDisallow: /\n"
+    end
+    # A robots.txt body that allows access to everywhere
+    def all_allowed
+      "User-agent: *\nDisallow:\n"
+    end
+    # Decode the response's body according to the character encoding in the HTTP
+    # headers.
+    # In the case that we can't decode, Ruby's laissez faire attitude to encoding
+    # should mean that we have a reasonable chance of working anyway.
+    def decode_body(response)
+      return nil if response.body.nil?
+      Robotstxt.ultimate_scrubber(response.body)
+    end
+  end
+end

data/lib/robotstxt/parser.rb ADDED Viewed

@@ -0,0 +1,256 @@
+module Robotstxt
+  # Parses robots.txt files for the perusal of a single user-agent.
+  #
+  # The behaviour implemented is guided by the following sources, though
+  # as there is no widely accepted standard, it may differ from other implementations.
+  # If you consider its behaviour to be in error, please contact the author.
+  #
+  # http://www.robotstxt.org/orig.html
+  #  - the original, now imprecise and outdated version
+  # http://www.robotstxt.org/norobots-rfc.txt
+  #  - a much more precise, outdated version
+  # http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=156449&from=35237
+  #  - a few hints at modern protocol extensions.
+  #
+  # This parser only considers lines starting with (case-insensitively:)
+  #  Useragent: User-agent: Allow: Disallow: Sitemap:
+  #
+  # The file is divided into sections, each of which contains one or more User-agent:
+  # lines, followed by one or more Allow: or Disallow: rules.
+  #
+  # The first section that contains a User-agent: line that matches the robot's
+  # user-agent, is the only section that relevent to that robot. The sections are checked
+  # in the same order as they appear in the file.
+  #
+  # (The * character is taken to mean "any number of any characters" during matching of
+  #  user-agents)
+  #
+  # Within that section, the first Allow: or Disallow: rule that matches the expression
+  # is taken as authoritative. If no rule in a section matches, the access is Allowed.
+  #
+  # (The order of matching is as in the RFC, Google matches all Allows and then all Disallows,
+  #  while Bing matches the most specific rule, I'm sure there are other interpretations)
+  #
+  # When matching urls, all % encodings are normalised (except for /?=& which have meaning)
+  # and "*"s match any number of any character.
+  #
+  # If a pattern ends with a $, then the pattern must match the entire path, or the entire
+  # path with query string.
+  #
+  class Parser
+    include CommonMethods
+    # Gets every Sitemap mentioned in the body of the robots.txt file.
+    #
+    attr_reader :sitemaps
+    # Create a new parser for this user_agent and this robots.txt contents.
+    #
+    # This assumes that the robots.txt is ready-to-parse, in particular that
+    # it has been decoded as necessary, including removal of byte-order-marks et.al.
+    #
+    # Not passing a body is deprecated, but retained for compatibility with clients
+    # written for version 0.5.4.
+    #
+    def initialize(user_agent, body)
+      @robot_id = user_agent
+      @found = true
+      parse(body) # set @body, @rules and @sitemaps
+    end
+    # Given a URI object, or a string representing one, determine whether this
+    # robots.txt would allow access to the path.
+    def allowed?(uri)
+      uri = objectify_uri(uri)
+      path = (uri.path || "/") + (uri.query ? '?' + uri.query : '')
+      path_allowed?(@robot_id, path)
+    end
+    protected
+    # Check whether the relative path (a string of the url's path and query
+    # string) is allowed by the rules we have for the given user_agent.
+    #
+    def path_allowed?(user_agent, path)
+      @rules.each do |(ua_glob, path_globs)|
+        if match_ua_glob user_agent, ua_glob
+          path_globs.each do |(path_glob, allowed)|
+            return allowed if match_path_glob path, path_glob
+          end
+          return true
+        end
+      end
+      true
+    end
+    # This does a case-insensitive substring match such that if the user agent
+    # is contained within the glob, or vice-versa, we will match.
+    #
+    # According to the standard, *s shouldn't appear in the user-agent field
+    # except in the case of "*" meaning all user agents. Google however imply
+    # that the * will work, at least at the end of a string.
+    #
+    # For consistency, and because it seems expected behaviour, and because
+    # a glob * will match a literal * we use glob matching not string matching.
+    #
+    # The standard also advocates a substring match of the robot's user-agent
+    # within the user-agent field. From observation, it seems much more likely
+    # that the match will be the other way about, though we check for both.
+    #
+    def match_ua_glob(user_agent, glob)
+      glob =~ Regexp.new(Regexp.escape(user_agent), "i") ||
+          user_agent =~ Regexp.new(reify(glob), "i")
+    end
+    # This does case-sensitive prefix matching, such that if the path starts
+    # with the glob, we will match.
+    #
+    # According to the standard, that's it. However, it seems reasonably common
+    # for asterkisks to be interpreted as though they were globs.
+    #
+    # Additionally, some search engines, like Google, will treat a trailing $
+    # sign as forcing the glob to match the entire path - whether including
+    # or excluding the query string is not clear, so we check both.
+    #
+    # (i.e. it seems likely that a site owner who has Disallow: *.pdf$ expects
+    # to disallow requests to *.pdf?i_can_haz_pdf, which the robot could, if
+    # it were feeling malicious, construe.)
+    #
+    # With URLs there is the additional complication that %-encoding can give
+    # multiple representations for identical URLs, this is handled by
+    # normalize_percent_encoding.
+    #
+    def match_path_glob(path, glob)
+      if glob =~ /\$$/
+        end_marker = '(?:\?|$)'
+        glob = glob.gsub /\$$/, ""
+      else
+        end_marker = ""
+      end
+      glob = Robotstxt.ultimate_scrubber normalize_percent_encoding(glob)
+      path = Robotstxt.ultimate_scrubber normalize_percent_encoding(path)
+      path =~ Regexp.new("^" + reify(glob) + end_marker)
+    # Some people encode bad UTF-8 in their robots.txt files, let us not behave badly.
+    rescue RegexpError
+      false
+    end
+    # As a general rule, we want to ignore different representations of the
+    # same URL. Naively we could just unescape, or escape, everything, however
+    # the standard implies that a / is a HTTP path separator, while a %2F is an
+    # encoded / that does not act as a path separator. Similar issues with ?, &
+    # and =, though all other characters are fine. (While : also has a special
+    # meaning in HTTP, most implementations ignore this in the path)
+    #
+    # It's also worth noting that %-encoding is case-insensitive, so we
+    # explicitly upcase the few that we want to keep.
+    #
+    def normalize_percent_encoding(path)
+      # First double-escape any characters we don't want to unescape
+      #                   &  /  =  ?
+      path = path.gsub(/%(26|2F|3D|3F)/i) do |code|
+        "%25#{code.upcase}"
+      end
+      URI.unescape(path)
+    end
+    # Convert the asterisks in a glob into (.*)s for regular expressions,
+    # and at the same time, escape any other characters that would have
+    # a significance in a regex.
+    #
+    def reify(glob)
+      glob = Robotstxt.ultimate_scrubber(glob)
+      # -1 on a split prevents trailing empty strings from being deleted.
+      glob.split("*", -1).map{ |part| Regexp.escape(part) }.join(".*")
+    end
+    # Convert the @body into a set of @rules so that our parsing mechanism
+    # becomes easier.
+    #
+    # @rules is an array of pairs. The first in the pair is the glob for the
+    # user-agent and the second another array of pairs. The first of the new
+    # pair is a glob for the path, and the second whether it appears in an
+    # Allow: or a Disallow: rule.
+    #
+    # For example:
+    #
+    # User-agent: *
+    # Disallow: /secret/
+    # Allow: /     # allow everything...
+    #
+    # Would be parsed so that:
+    #
+    # @rules = [["*", [ ["/secret/", false], ["/", true] ]]]
+    #
+    #
+    # The order of the arrays is maintained so that the first match in the file
+    # is obeyed as indicated by the pseudo-RFC on http://robotstxt.org/. There
+    # are alternative interpretations, some parse by speicifity of glob, and
+    # some check Allow lines for any match before Disallow lines. All are
+    # justifiable, but we could only pick one.
+    #
+    # Note that a blank Disallow: should be treated as an Allow: * and multiple
+    # user-agents may share the same set of rules.
+    #
+    def parse(body)
+      @body = Robotstxt.ultimate_scrubber(body)
+      @rules = []
+      @sitemaps = []
+      body.split(/[\r\n]+/).each do |line|
+        prefix, value = line.delete("\000").split(":", 2).map(&:strip)
+        value.sub! /\s+#.*/, '' if value
+        parser_mode = :begin
+        if prefix && value
+          case prefix.downcase
+            when /^user-?agent$/
+              if parser_mode == :user_agent
+                @rules << [value, rules.last[1]]
+              else
+                parser_mode = :user_agent
+                @rules << [value, []]
+              end
+            when "disallow"
+              parser_mode = :rules
+              @rules << ["*", []] if @rules.empty?
+              if value == ""
+                @rules.last[1] << ["*", true]
+              else
+                @rules.last[1] << [value, false]
+              end
+            when "allow"
+              parser_mode = :rules
+              @rules << ["*", []] if @rules.empty?
+              @rules.last[1] << [value, true]
+            when "sitemap"
+              @sitemaps << value
+            else
+              # Ignore comments, Crawl-delay: and badly formed lines.
+          end
+        end
+      end
+    end
+  end
+end

data/robotstxt.gemspec ADDED Viewed

@@ -0,0 +1,19 @@
+# -*- encoding: utf-8 -*-
+$:.push File.expand_path("../lib", __FILE__)
+Gem::Specification.new do |gem|
+  gem.name          = "robotstxt-parser"
+  gem.version       = "0.1.0"
+  gem.authors       = ["Garen Torikian"]
+  gem.email         = ["gjtorikian@gmail.com"]
+  gem.description   = %q{Robotstxt-Parser allows you to the check the accessibility of URLs and get other data. Full support for the robots.txt RFC, wildcards and Sitemap: rules.}
+  gem.summary       = %q{Robotstxt-parser is an Ruby robots.txt file parser.}
+  gem.homepage      = "https://github.com/gjtorikian/robotstxt-parser"
+  gem.license       = "MIT"
+  gem.files         = `git ls-files`.split($/)
+  gem.test_files    = gem.files.grep(%r{^(text)/})
+  gem.require_paths = ["lib"]
+  gem.add_development_dependency "rake"
+  gem.add_development_dependency "fakeweb", '~> 1.3'
+end

data/test/getter_test.rb ADDED Viewed

@@ -0,0 +1,74 @@
+# -*- encoding: utf-8 -*-
+$:.unshift(File.dirname(__FILE__) + '/../lib')
+require 'rubygems'
+require 'test/unit'
+require 'robotstxt'
+require 'fakeweb'
+FakeWeb.allow_net_connect = false
+class TestRobotstxt < Test::Unit::TestCase
+  def test_absense
+    FakeWeb.register_uri(:get, "http://example.com/robots.txt", :status => ["404", "Not found"])
+    assert true == Robotstxt.get_allowed?("http://example.com/index.html", "Google")
+  end
+  def test_error
+    FakeWeb.register_uri(:get, "http://example.com/robots.txt", :status => ["500", "Internal Server Error"])
+    assert true == Robotstxt.get_allowed?("http://example.com/index.html", "Google")
+  end
+  def test_unauthorized
+    FakeWeb.register_uri(:get, "http://example.com/robots.txt", :status => ["401", "Unauthorized"])
+    assert false == Robotstxt.get_allowed?("http://example.com/index.html", "Google")
+  end
+  def test_forbidden
+    FakeWeb.register_uri(:get, "http://example.com/robots.txt", :status => ["403", "Forbidden"])
+    assert false == Robotstxt.get_allowed?("http://example.com/index.html", "Google")
+  end
+  def test_uri_object
+    FakeWeb.register_uri(:get, "http://example.com/robots.txt", :body => "User-agent:*\nDisallow: /test")
+    robotstxt = Robotstxt.get(URI.parse("http://example.com/index.html"), "Google")
+    assert true == robotstxt.allowed?("/index.html")
+    assert false == robotstxt.allowed?("/test/index.html")
+  end
+  def test_existing_http_connection
+    FakeWeb.register_uri(:get, "http://example.com/robots.txt", :body => "User-agent:*\nDisallow: /test")
+    http = Net::HTTP.start("example.com", 80) do |http|
+      robotstxt = Robotstxt.get(http, "Google")
+      assert true == robotstxt.allowed?("/index.html")
+      assert false == robotstxt.allowed?("/test/index.html")
+    end
+  end
+  def test_redirects
+    FakeWeb.register_uri(:get, "http://example.com/robots.txt", :response => "HTTP/1.1 303 See Other\nLocation: http://www.exemplar.com/robots.txt\n\n")
+    FakeWeb.register_uri(:get, "http://www.exemplar.com/robots.txt", :body => "User-agent:*\nDisallow: /private")
+    robotstxt = Robotstxt.get("http://example.com/", "Google")
+    assert true == robotstxt.allowed?("/index.html")
+    assert false == robotstxt.allowed?("/private/index.html")
+  end
+  def test_encoding
+    # "User-agent: *\n Disallow: /encyclop@dia" where @ is the ae ligature (U+00E6)
+    FakeWeb.register_uri(:get, "http://example.com/robots.txt", :response => "HTTP/1.1 200 OK\nContent-type: text/plain; charset=utf-16\n\n" +
+        "\xff\xfeU\x00s\x00e\x00r\x00-\x00a\x00g\x00e\x00n\x00t\x00:\x00 \x00*\x00\n\x00D\x00i\x00s\x00a\x00l\x00l\x00o\x00w\x00:\x00 \x00/\x00e\x00n\x00c\x00y\x00c\x00l\x00o\x00p\x00\xe6\x00d\x00i\x00a\x00")
+    robotstxt = Robotstxt.get("http://example.com/#index", "Google")
+    assert true == robotstxt.allowed?("/index.html")
+    assert false == robotstxt.allowed?("/encyclop%c3%a6dia/index.html")
+  end
+end

data/test/parser_test.rb ADDED Viewed

@@ -0,0 +1,114 @@
+# -*- encoding: utf-8 -*-
+$:.unshift(File.dirname(__FILE__) + '/../lib')
+require 'test/unit'
+require 'robotstxt'
+require 'cgi'
+class TestParser < Test::Unit::TestCase
+  def test_basics
+    client = Robotstxt::Parser.new("Test", <<-ROBOTS
+User-agent: *
+Disallow: /?*\t\t\t#comment
+Disallow: /home
+Disallow: /dashboard
+Disallow: /terms-conditions
+Disallow: /privacy-policy
+Disallow: /index.php
+Disallow: /chargify_system
+Disallow: /test*
+Disallow: /team*     # comment
+Disallow: /index
+Allow: /    # comment
+Sitemap: http://example.com/sitemap.xml
+ROBOTS
+)
+    assert true == client.allowed?("/")
+    assert false == client.allowed?("/?")
+    assert false == client.allowed?("/?key=value")
+    assert true == client.allowed?("/example")
+    assert true == client.allowed?("/example/index.php")
+    assert false == client.allowed?("/test")
+    assert false == client.allowed?("/test/example")
+    assert false == client.allowed?("/team-game")
+    assert false == client.allowed?("/team-game/example")
+    assert ["http://example.com/sitemap.xml"] == client.sitemaps
+  end
+  def test_blank_disallow
+    google = Robotstxt::Parser.new("Google", <<-ROBOTSTXT
+User-agent: *
+Disallow:
+ROBOTSTXT
+                                  )
+    assert true == google.allowed?("/")
+    assert true == google.allowed?("/index.html")
+  end
+  def test_url_escaping
+    google = Robotstxt::Parser.new("Google", <<-ROBOTSTXT
+User-agent: *
+Disallow: /test/
+Disallow: /secret%2Fgarden/
+Disallow: /%61lpha/
+ROBOTSTXT
+)
+    assert true == google.allowed?("/allowed/")
+    assert false == google.allowed?("/test/")
+    assert true == google.allowed?("/test%2Fetc/")
+    assert false == google.allowed?("/secret%2fgarden/")
+    assert true == google.allowed?("/secret/garden/")
+    assert false == google.allowed?("/alph%61/")
+  end
+  def test_trail_matching
+    google = Robotstxt::Parser.new("Google", <<-ROBOTSTXT
+User-agent: *
+   #comments
+Disallow: /*.pdf$
+ROBOTSTXT
+)
+    assert true == google.allowed?("/.pdfs/index.html")
+    assert false == google.allowed?("/.pdfs/index.pdf")
+    assert false == google.allowed?("/.pdfs/index.pdf?action=view")
+    assert false == google.allowed?("/.pdfs/index.html?download_as=.pdf")
+  end
+  def test_useragents
+    robotstxt = <<-ROBOTS
+User-agent: Google
+User-agent: Yahoo
+Disallow:
+User-agent: *
+Disallow: /
+ROBOTS
+    assert true == Robotstxt::Parser.new("Google", robotstxt).allowed?("/hello")
+    assert true == Robotstxt::Parser.new("Yahoo", robotstxt).allowed?("/hello")
+    assert false == Robotstxt::Parser.new("Bing", robotstxt).allowed?("/hello")
+  end
+  def test_missing_useragent
+    robotstxt = <<-ROBOTS
+Disallow: /index
+ROBOTS
+    assert true === Robotstxt::Parser.new("Google", robotstxt).allowed?("/hello")
+    assert false === Robotstxt::Parser.new("Google", robotstxt).allowed?("/index/wold")
+  end
+  def test_strange_newlines
+    robotstxt = "User-agent: *\r\r\rDisallow: *"
+    assert false === Robotstxt::Parser.new("Google", robotstxt).allowed?("/index/wold")
+  end
+  def test_bad_unicode
+    unless ENV['TRAVIS']
+      robotstxt = "User-agent: *\ndisallow: /?id=%C3%CB%D1%CA%A4%C5%D4%BB%C7%D5%B4%D5%E2%CD\n"
+      assert true === Robotstxt::Parser.new("Google", robotstxt).allowed?("/index/wold")
+    end
+  end
+end

metadata ADDED Viewed

@@ -0,0 +1,86 @@
+--- !ruby/object:Gem::Specification
+name: robotstxt-parser
+version: !ruby/object:Gem::Version
+  version: 0.1.0
+platform: ruby
+authors:
+- Garen Torikian
+autorequire:
+bindir: bin
+cert_chain: []
+date: 2014-09-18 00:00:00.000000000 Z
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: rake
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: fakeweb
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '1.3'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '1.3'
+description: 'Robotstxt-Parser allows you to the check the accessibility of URLs and
+  get other data. Full support for the robots.txt RFC, wildcards and Sitemap: rules.'
+email:
+- gjtorikian@gmail.com
+executables: []
+extensions: []
+extra_rdoc_files: []
+files:
+- ".gitignore"
+- ".travis.yml"
+- Gemfile
+- LICENSE.rdoc
+- README.rdoc
+- Rakefile
+- lib/robotstxt.rb
+- lib/robotstxt/common.rb
+- lib/robotstxt/getter.rb
+- lib/robotstxt/parser.rb
+- robotstxt.gemspec
+- test/getter_test.rb
+- test/parser_test.rb
+homepage: https://github.com/gjtorikian/robotstxt-parser
+licenses:
+- MIT
+metadata: {}
+post_install_message:
+rdoc_options: []
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '0'
+required_rubygems_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '0'
+requirements: []
+rubyforge_project:
+rubygems_version: 2.2.2
+signing_key:
+specification_version: 4
+summary: Robotstxt-parser is an Ruby robots.txt file parser.
+test_files: []