RubyGems - google_robotstxt_parser - Versions diffs - 0.0.3 - Mend

google_robotstxt_parser 0.0.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (33) hide show

checksums.yaml +7 -0
data/.gitignore +28 -0
data/.gitmodules +3 -0
data/CHANGELOG.md +5 -0
data/CODE_OF_CONDUCT.md +46 -0
data/Gemfile +6 -0
data/Guardfile +16 -0
data/LICENSE +22 -0
data/README.md +57 -0
data/Rakefile +6 -0
data/ext/robotstxt/.DS_Store +0 -0
data/ext/robotstxt/extconf.rb +83 -0
data/ext/robotstxt/robotstxt/.gitignore +1 -0
data/ext/robotstxt/robotstxt/BUILD +40 -0
data/ext/robotstxt/robotstxt/CMakeLists.txt +174 -0
data/ext/robotstxt/robotstxt/CMakeLists.txt.in +30 -0
data/ext/robotstxt/robotstxt/CONTRIBUTING.md +30 -0
data/ext/robotstxt/robotstxt/LICENSE +203 -0
data/ext/robotstxt/robotstxt/README.md +134 -0
data/ext/robotstxt/robotstxt/WORKSPACE +28 -0
data/ext/robotstxt/robotstxt/protocol-draft/README.md +9 -0
data/ext/robotstxt/robotstxt/protocol-draft/draft-koster-rep-00.txt +529 -0
data/ext/robotstxt/robotstxt/robots.cc +706 -0
data/ext/robotstxt/robotstxt/robots.h +241 -0
data/ext/robotstxt/robotstxt/robots_main.cc +101 -0
data/ext/robotstxt/robotstxt/robots_test.cc +990 -0
data/ext/robotstxt/robotstxt.cc +32 -0
data/google_robotstxt_parser.gemspec +45 -0
data/lib/google_robotstxt_parser/version.rb +6 -0
data/lib/google_robotstxt_parser.rb +4 -0
data/spec/google_robotstxt_parser_spec.rb +33 -0
data/spec/spec_helper.rb +19 -0
metadata +146 -0

data/ext/robotstxt/robotstxt/protocol-draft/draft-koster-rep-00.txt ADDED Viewed

@@ -0,0 +1,529 @@
+Network Working Group                                          M. Koster
+Internet-Draft                                Stalworthy Computing, Ltd.
+Intended status: Draft Standard                                G. Illyes
+Expires: January 9, 2020                                       H. Zeller
+                                                               L. Harvey
+                                                                  Google
+                                                           July 07, 2019
+                       Robots Exclusion Protocol
+                         draft-koster-rep-00
+Abstract
+   This document standardizes and extends the "Robots Exclusion
+   Protocol" <http://www.robotstxt.org/> method originally defined by
+   Martijn Koster in 1996 for service owners to control how content
+   served by their services may be accessed, if at all, by automatic
+   clients known as crawlers.
+Status of This Memo
+   This Internet-Draft is submitted in full conformance with the
+   provisions of BCP 78 and BCP 79.
+   Internet-Drafts are working documents of the Internet Engineering
+   Task Force (IETF).  Note that other groups may also distribute
+   working documents as Internet-Drafts.  The list of current Internet-
+   Drafts is at http://datatracker.ietf.org/drafts/current/.
+   Internet-Drafts are draft documents valid for a maximum of six months
+   and may be updated, replaced, or obsoleted by other documents at any
+   time.  It is inappropriate to use Internet-Drafts as reference
+   material or to cite them other than as "work in progress."
+   This document may not be modified, and derivative works of it may not
+   be created, except to format it for publication as an RFC or to
+   translate it into languages other than English.
+   This Internet-Draft will expire on January 9, 2020.
+Copyright Notice
+   Copyright (c) 2019 IETF Trust and the persons identified as the
+   document authors.  All rights reserved.
+   This document is subject to BCP 78 and the IETF Trust's Legal
+   Provisions Relating to IETF Documents
+   (http://trustee.ietf.org/license-info) in effect on the date of
+   publication of this document.  Please review these documents
+   carefully, as they describe your rights and restrictions with respect
+   to this document.  Code Components extracted from this document must
+   include Simplified BSD License text as described in Section 4.e of
+   the Trust Legal Provisions and are provided without warranty as
+   described in the Simplified BSD License.
+Koster, et al.           Expires January 9, 2020                [Page 1]
+Internet-Draft                     I-D                         July 2019
+Table of Contents
+   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
+     1.1.  Terminology . . . . . . . . . . . . . . . . . . . . . . .   2
+   2.  Specification . . . . . . . . . . . . . . . . . . . . . . . .   3
+     2.1.  Protocol definition . . . . . . . . . . . . . . . . . . .   3
+     2.2.  Formal syntax . . . . . . . . . . . . . . . . . . . . . .   3
+       2.2.1.  The user-agent line . . . . . . . . . . . . . . . . .   4
+       2.2.2.  The Allow and Disallow lines  . . . . . . . . . . . .   4
+       2.2.3.  Special characters  . . . . . . . . . . . . . . . . .   5
+       2.2.4.  Other records . . . . . . . . . . . . . . . . . . . .   6
+     2.3.  Access method . . . . . . . . . . . . . . . . . . . . . .   6
+       2.3.1.  Access results  . . . . . . . . . . . . . . . . . . .   7
+     2.4.  Caching . . . . . . . . . . . . . . . . . . . . . . . . .   8
+     2.5.  Limits  . . . . . . . . . . . . . . . . . . . . . . . . .   8
+     2.6.  Security Considerations . . . . . . . . . . . . . . . . .   8
+     2.7.  IANA Considerations . . . . . . . . . . . . . . . . . . .   8
+   3.  Examples  . . . . . . . . . . . . . . . . . . . . . . . . . .   8
+     3.1.  Simple example  . . . . . . . . . . . . . . . . . . . . .   8
+     3.2.  Longest Match . . . . . . . . . . . . . . . . . . . . . .   9
+   4.  References  . . . . . . . . . . . . . . . . . . . . . . . . .   9
+     4.1.  Normative References  . . . . . . . . . . . . . . . . . .   9
+     4.2.  URIs  . . . . . . . . . . . . . . . . . . . . . . . . . .   9
+   Author's Address  . . . . . . . . . . . . . . . . . . . . . . . .   10
+1.  Introduction
+   This document applies to services that provide resources that clients
+   can access through URIs as defined in RFC3986 [1].  For example, in
+   the context of HTTP, a browser is a client that displays the content
+   of a web page.
+   Crawlers are automated clients.  Search engines for instance have
+   crawlers to recursively traverse links for indexing as defined in
+   RFC8288 [2].
+   It may be inconvenient for service owners if crawlers visit the
+   entirety of their URI space.  This document specifies the rules that
+   crawlers MUST obey when accessing URIs.
+   These rules are not a form of access authorization.
+1.1.  Terminology
+   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
+   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
+   "OPTIONAL" in this document are to be interpreted as described in
+   BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all
+   capitals, as shown here.
+Koster, et al.           Expires January 9, 2020                [Page 2]
+Internet-Draft                     I-D                         July 2019
+2.  Specification
+2.1.  Protocol definition
+   The protocol language consists of rule(s) and group(s):
+   o  *Rule*: A line with a key-value pair that defines how a crawler
+      may access URIs. See section The Allow and Disallow lines.
+   o  *Group*: One or more user-agent lines that is followed by one or
+      more rules. The group is terminated by a user-agent line or end
+      of file. See User-agent line. The last group may have no rules,
+      which means it implicitly allows everything.
+2.2.  Formal syntax
+   Below is an Augmented Backus-Naur Form (ABNF) description, as
+   described in RFC5234 [3].
+   robotstxt = *(group / emptyline)
+   group = startgroupline                ; We start with a user-agent
+           *(startgroupline / emptyline) ; ... and possibly more
+                                         ; user-agents
+           *(rule / emptyline)           ; followed by rules relevant
+                                         ; for UAs
+   startgroupline = *WS "user-agent" *WS ":" *WS product-token EOL
+   rule = *WS ("allow" / "disallow") *WS ":"
+          *WS (path-pattern / empty-pattern) EOL
+   ; parser implementors: add additional lines you need (for
+   ; example Sitemaps), and be lenient when reading lines that don't
+   ; conform. Apply Postel's law.
+   product-token = identifier / "*"
+   path-pattern = "/" *(UTF8-char-noctl) ; valid URI path pattern
+   empty-pattern = *WS
+   identifier = 1*(%x2d / %x41-5a / %x5f / %x61-7a)
+   comment = "#"*(UTF8-char-noctl / WS / "#")
+   emptyline = EOL EOL = *WS [comment] NL ; end-of-line may have
+                                          ; optional trailing comment
+   NL = %x0D / %x0A / %x0D.0A
+   WS = %x20 / %x09
+Koster, et al.           Expires January 9, 2020                [Page 3]
+Internet-Draft                     I-D                         July 2019
+   ; UTF8 derived from RFC3629, but excluding control characters
+   UTF8-char-noctl = UTF8-1-noctl / UTF8-2 / UTF8-3 / UTF8-4
+   UTF8-1-noctl = %x21 / %x22 / %x24-7F ; excluding control, space, '#'
+   UTF8-2 = %xC2-DF UTF8-tail
+   UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) /
+            %xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail )
+   UTF8-4 = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F3 3( UTF8-tail ) /
+            %xF4 %x80-8F 2( UTF8-tail )
+   UTF8-tail = %x80-BF
+2.2.1.  The user-agent line
+   Crawlers set a product token to find relevant groups.  The product
+   token MUST contain only "a-zA-Z_-" characters.  The product token
+   SHOULD be part of the identification string that the crawler sends
+   to the service (for example, in the case of HTTP, the product name
+   SHOULD be in the user-agent header).  The identification string
+   SHOULD describe the purpose of the crawler.  Here's an example of an
+   HTTP header with a link pointing to a page describing the purpose of
+   the ExampleBot crawler which appears both in the HTTP header and as a
+   product token:
+   +-------------------------------------------------+-----------------+
+   | HTTP header                                     | robots.txt      |
+   |                                                 | user-agent line |
+   +-------------------------------------------------+-----------------+
+   | user-agent: Mozilla/5.0 (compatible;            | user-agent:     |
+   | ExampleBot/0.1;                                 | ExampleBot      |
+   | https://www.example.com/bot.html)               |                 |
+   +-------------------------------------------------+-----------------+
+   Crawlers MUST find the group that matches the product token exactly,
+   and then obey the rules of the group.  If there is more than one
+   group matching the user-agent, the matching groups' rules MUST be
+   combined into one group.  The matching MUST be case-insensitive.  If
+   no matching group exists, crawlers MUST obey the first group with a
+   user-agent line with a "*" value, if present.  If no group satisfies
+   either condition, or no groups are present at all, no rules apply.
+2.2.2.  The Allow and Disallow lines
+   These lines indicate whether accessing a URI that matches the
+   corresponding path is allowed or disallowed.
+   To evaluate if access to a URI is allowed, a robot MUST match the
+   paths in allow and disallow rules against the URI.  The matching
+   SHOULD be case sensitive.  The most specific match found MUST be
+   used.  The most specific match is the match that has the most octets.
+   If an allow and disallow rule is equivalent, the allow SHOULD be
+   used.  If no match is found amongst the rules in a group for a
+Koster, et al.           Expires January 9, 2020                [Page 4]
+Internet-Draft                     I-D                         July 2019
+   matching user-agent, or there are no rules in the group, the URI is
+   allowed.  The /robots.txt URI is implicitly allowed.
+   Octets in the URI and robots.txt paths outside the range of the US-
+   ASCII coded character set, and those in the reserved range defined by
+   RFC3986 [1], MUST be percent-encoded as defined by RFC3986 [1] prior
+   to comparison.
+   If a percent-encoded US-ASCII octet is encountered in the URI, it
+   MUST be unencoded prior to comparison, unless it is a reserved
+   character in the URI as defined by RFC3986 [1] or the character is
+   outside the unreserved character range.  The match evaluates
+   positively if and only if the end of the path from the rule is
+   reached before a difference in octets is encountered.
+   For example:
+   +-------------------+-----------------------+-----------------------+
+   | Path              | Encoded Path          | Path to match         |
+   +-------------------+-----------------------+-----------------------+
+   | /foo/bar?baz=quz  | /foo/bar?baz=quz      | /foo/bar?baz=quz      |
+   |                   |                       |                       |
+   | /foo/bar?baz=http | /foo/bar?baz=http%3A% | /foo/bar?baz=http%3A% |
+   | ://foo.bar        | 2F%2Ffoo.bar          | 2F%2Ffoo.bar          |
+   |                   |                       |                       |
+   | /foo/bar/U+E38384 | /foo/bar/%E3%83%84    | /foo/bar/%E3%83%84    |
+   |                   |                       |                       |
+   | /foo/bar/%E3%83%8 | /foo/bar/%E3%83%84    | /foo/bar/%E3%83%84    |
+   | 4                 |                       |                       |
+   |                   |                       |                       |
+   | /foo/bar/%62%61%7 | /foo/bar/%62%61%7A    | /foo/bar/baz          |
+   | A                 |                       |                       |
+   +-------------------+-----------------------+-----------------------+
+   The crawler SHOULD ignore "disallow" and "allow" rules that are not
+   in any group (for example, any rule that precedes the first user-
+   agent line).
+   Implementers MAY bridge encoding mismatches if they detect that the
+   robots.txt file is not UTF8 encoded.
+2.2.3.  Special characters
+   Crawlers SHOULD allow the following special characters:
+Koster, et al.           Expires January 9, 2020                [Page 5]
+Internet-Draft                     I-D                         July 2019
+   +-----------+--------------------------------+----------------------+
+   | Character | Description                    | Example              |
+   +-----------+--------------------------------+----------------------+
+   | "#"       | Designates an end of line      | "allow: / # comment  |
+   |           | comment.                       | in line"             |
+   |           |                                |                      |
+   |           |                                | "# comment at the    |
+   |           |                                | end"                 |
+   |           |                                |                      |
+   | "$"       | Designates the end of the      | "allow:              |
+   |           | match pattern. A URI MUST end  | /this/path/exactly$" |
+   |           | with a $.                      |                      |
+   |           |                                |                      |
+   | "*"       | Designates 0 or more instances | "allow:              |
+   |           | of any character.              | /this/*/exactly"     |
+   +-----------+--------------------------------+----------------------+
+   If crawlers match special characters verbatim in the URI, crawlers
+   SHOULD use "%" encoding.  For example:
+   +------------------------+------------------------------------------+
+   | Pattern                | URI                                      |
+   +------------------------+------------------------------------------+
+   | /path/file-            | https://www.example.com/path/file-       |
+   | with-a-%2A.html        | with-a-*.html                            |
+   |                        |                                          |
+   | /path/foo-%24          | https://www.example.com/path/foo-$       |
+   +------------------------+------------------------------------------+
+2.2.4.  Other records
+   Clients MAY interpret other records that are not part of the
+   robots.txt protocol.  For example, 'sitemap' [4].
+2.3.  Access method
+   The rules MUST be accessible in a file named "/robots.txt" (all lower
+   case) in the top level path of the service.  The file MUST be UTF-8
+   encoded (as defined in RFC3629 [5]) and Internet Media Type "text/
+   plain" (as defined in RFC2046 [6]).
+   As per RFC3986 [1], the URI of the robots.txt is:
+   "scheme:[//authority]/robots.txt"
+   For example, in the context of HTTP or FTP, the URI is:
+       http://www.example.com/robots.txt
+Koster, et al.           Expires January 9, 2020                [Page 6]
+Internet-Draft                     I-D                         July 2019
+       https://www.example.com/robots.txt
+       ftp://ftp.example.com/robots.txt
+2.3.1.  Access results
+2.3.1.1.  Successful access
+   If the crawler successfully downloads the robots.txt, the crawler
+   MUST follow the parseable rules.
+2.3.1.2.  Redirects
+   The server may respond to a robots.txt fetch request with a redirect,
+   such as HTTP 301 and HTTP 302.  The crawlers SHOULD follow at least
+   five consecutive redirects, even across authorities (for example
+   hosts in case of HTTP), as defined in RFC1945 [7].
+   If a robots.txt file is reached within five consecutive redirects,
+   the robots.txt file MUST be fetched, parsed, and its rules followed
+   in the context of the initial authority.
+   If there are more than five consecutive redirects, crawlers MAY
+   assume that the robots.txt is unavailable.
+2.3.1.3.  Unavailable status
+   Unavailable means the crawler tries to fetch the robots.txt, and the
+   server responds with unavailable status codes.  For example, in the
+   context of HTTP, unavailable status codes are in the 400-499 range.
+   If a server status code indicates that the robots.txt file is
+   unavailable to the client, then crawlers MAY access any resources on
+   the server or MAY use a cached version of a robots.txt file for up to
+   24 hours.
+2.3.1.4.  Unreachable status
+   If the robots.txt is unreachable due to server or network errors,
+   this means the robots.txt is undefined and the crawler MUST assume
+   complete disallow.  For example, in the context of HTTP, an
+   unreachable robots.txt has a response code in the 500-599 range.  For
+   other undefined status codes, the crawler MUST assume the robots.txt
+   is unreachable.
+   If the robots.txt is undefined for a reasonably long period of time
+   (for example, 30 days), clients MAY assume the robots.txt is
+   unavailable or continue to use a cached copy.
+Koster, et al.           Expires January 9, 2020                [Page 7]
+Internet-Draft                     I-D                         July 2019
+2.3.1.5.  Parsing errors
+   Crawlers SHOULD try to parse each line of the robots.txt file.
+   Crawlers MUST use the parseable rules.
+2.4.  Caching
+   Crawlers MAY cache the fetched robots.txt file's contents.  Crawlers
+   MAY use standard cache control as defined in RFC2616 [8].  Crawlers
+   SHOULD NOT use the cached version for more than 24 hours, unless the
+   robots.txt is unreachable.
+2.5.  Limits
+   Crawlers MAY impose a parsing limit that MUST be at least 500
+   kibibytes (KiB).
+2.6.  Security Considerations
+   The Robots Exclusion Protocol MUST NOT be used as a form of security
+   measures. Listing URIs in the robots.txt file exposes the URI
+   publicly and thus making the URIs discoverable.
+2.7.  IANA Considerations.
+   This document has no actions for IANA.
+3.  Examples
+3.1.  Simple example
+   The following example shows:
+   o  *foobot*: A regular case.  A single user-agent token followed by
+      rules.
+   o  *barbot and bazbot*: A group that's relevant for more than one
+      user-agent.
+   o  *quxbot:* Empty group at end of file.
+Koster, et al.           Expires January 9, 2020                [Page 8]
+Internet-Draft                     I-D                         July 2019
+   <CODE BEGINS>
+   User-Agent : foobot
+   Disallow : /example/page.html
+   Disallow : /example/disallowed.gif
+   User-Agent : barbot
+   User-Agent : bazbot
+   Allow : /example/page.html
+   Disallow : /example/disallowed.gif
+   User-Agent: quxbot
+   EOF
+   <CODE ENDS>
+3.2.  Longest Match
+   The following example shows that in the case of a two rules, the
+   longest one MUST be used for matching.  In the following case,
+   /example/page/disallowed.gif MUST be used for the URI
+   example.com/example/page/disallow.gif .
+   <CODE BEGINS>
+   User-Agent : foobot
+   Allow : /example/page/
+   Disallow : /example/page/disallowed.gif
+   <CODE ENDS>
+4.  References
+4.1.  Normative References
+   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
+              Requirement Levels", BCP 14, RFC 2119, March 1997.
+   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in
+              RFC 2119 Key Words", BCP 14, RFC 2119, May 2017.
+4.2.  URIs
+   [1] https://tools.ietf.org/html/rfc3986
+   [2] https://tools.ietf.org/html/rfc8288
+   [3] https://tools.ietf.org/html/rfc5234
+   [4] https://www.sitemaps.org/index.html
+   [5] https://tools.ietf.org/html/rfc3629
+   [6] https://tools.ietf.org/html/rfc2046
+Koster, et al.           Expires January 9, 2020                [Page 9]
+Internet-Draft                     I-D                         July 2019
+   [7] https://tools.ietf.org/html/rfc1945
+   [8] https://tools.ietf.org/html/rfc2616
+Authors' Address
+   Martijn Koster
+   Stalworthy Manor Farm
+   Suton Lane, NR18 9JG
+   Wymondham, Norfolk
+   United Kingdom
+   Email: m.koster@greenhills.co.uk
+   Gary Illyes
+   Brandschenkestrasse 110
+   8002, Zurich
+   Switzerland
+   Email: garyillyes@google.com
+   Henner Zeller
+   1600 Amphitheatre Pkwy
+   Mountain View, CA 94043
+   USA
+   Email: henner@google.com
+   Lizzi Harvey
+   1600 Amphitheatre Pkwy
+   Mountain View, CA 94043
+   USA
+   Email: lizzi@google.com
+Koster, et al.           Expires January 9, 2020               [Page 10]