google_robotstxt_parser 0.0.3
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +7 -0
- data/.gitignore +28 -0
- data/.gitmodules +3 -0
- data/CHANGELOG.md +5 -0
- data/CODE_OF_CONDUCT.md +46 -0
- data/Gemfile +6 -0
- data/Guardfile +16 -0
- data/LICENSE +22 -0
- data/README.md +57 -0
- data/Rakefile +6 -0
- data/ext/robotstxt/.DS_Store +0 -0
- data/ext/robotstxt/extconf.rb +83 -0
- data/ext/robotstxt/robotstxt/.gitignore +1 -0
- data/ext/robotstxt/robotstxt/BUILD +40 -0
- data/ext/robotstxt/robotstxt/CMakeLists.txt +174 -0
- data/ext/robotstxt/robotstxt/CMakeLists.txt.in +30 -0
- data/ext/robotstxt/robotstxt/CONTRIBUTING.md +30 -0
- data/ext/robotstxt/robotstxt/LICENSE +203 -0
- data/ext/robotstxt/robotstxt/README.md +134 -0
- data/ext/robotstxt/robotstxt/WORKSPACE +28 -0
- data/ext/robotstxt/robotstxt/protocol-draft/README.md +9 -0
- data/ext/robotstxt/robotstxt/protocol-draft/draft-koster-rep-00.txt +529 -0
- data/ext/robotstxt/robotstxt/robots.cc +706 -0
- data/ext/robotstxt/robotstxt/robots.h +241 -0
- data/ext/robotstxt/robotstxt/robots_main.cc +101 -0
- data/ext/robotstxt/robotstxt/robots_test.cc +990 -0
- data/ext/robotstxt/robotstxt.cc +32 -0
- data/google_robotstxt_parser.gemspec +45 -0
- data/lib/google_robotstxt_parser/version.rb +6 -0
- data/lib/google_robotstxt_parser.rb +4 -0
- data/spec/google_robotstxt_parser_spec.rb +33 -0
- data/spec/spec_helper.rb +19 -0
- metadata +146 -0
@@ -0,0 +1,529 @@
|
|
1
|
+
Network Working Group M. Koster
|
2
|
+
Internet-Draft Stalworthy Computing, Ltd.
|
3
|
+
Intended status: Draft Standard G. Illyes
|
4
|
+
Expires: January 9, 2020 H. Zeller
|
5
|
+
L. Harvey
|
6
|
+
Google
|
7
|
+
July 07, 2019
|
8
|
+
|
9
|
+
|
10
|
+
Robots Exclusion Protocol
|
11
|
+
draft-koster-rep-00
|
12
|
+
|
13
|
+
Abstract
|
14
|
+
|
15
|
+
This document standardizes and extends the "Robots Exclusion
|
16
|
+
Protocol" <http://www.robotstxt.org/> method originally defined by
|
17
|
+
Martijn Koster in 1996 for service owners to control how content
|
18
|
+
served by their services may be accessed, if at all, by automatic
|
19
|
+
clients known as crawlers.
|
20
|
+
|
21
|
+
Status of This Memo
|
22
|
+
|
23
|
+
This Internet-Draft is submitted in full conformance with the
|
24
|
+
provisions of BCP 78 and BCP 79.
|
25
|
+
|
26
|
+
Internet-Drafts are working documents of the Internet Engineering
|
27
|
+
Task Force (IETF). Note that other groups may also distribute
|
28
|
+
working documents as Internet-Drafts. The list of current Internet-
|
29
|
+
Drafts is at http://datatracker.ietf.org/drafts/current/.
|
30
|
+
|
31
|
+
Internet-Drafts are draft documents valid for a maximum of six months
|
32
|
+
and may be updated, replaced, or obsoleted by other documents at any
|
33
|
+
time. It is inappropriate to use Internet-Drafts as reference
|
34
|
+
material or to cite them other than as "work in progress."
|
35
|
+
|
36
|
+
This document may not be modified, and derivative works of it may not
|
37
|
+
be created, except to format it for publication as an RFC or to
|
38
|
+
translate it into languages other than English.
|
39
|
+
|
40
|
+
This Internet-Draft will expire on January 9, 2020.
|
41
|
+
|
42
|
+
Copyright Notice
|
43
|
+
|
44
|
+
Copyright (c) 2019 IETF Trust and the persons identified as the
|
45
|
+
document authors. All rights reserved.
|
46
|
+
|
47
|
+
This document is subject to BCP 78 and the IETF Trust's Legal
|
48
|
+
Provisions Relating to IETF Documents
|
49
|
+
(http://trustee.ietf.org/license-info) in effect on the date of
|
50
|
+
publication of this document. Please review these documents
|
51
|
+
carefully, as they describe your rights and restrictions with respect
|
52
|
+
to this document. Code Components extracted from this document must
|
53
|
+
include Simplified BSD License text as described in Section 4.e of
|
54
|
+
the Trust Legal Provisions and are provided without warranty as
|
55
|
+
described in the Simplified BSD License.
|
56
|
+
|
57
|
+
Koster, et al. Expires January 9, 2020 [Page 1]
|
58
|
+
|
59
|
+
Internet-Draft I-D July 2019
|
60
|
+
|
61
|
+
Table of Contents
|
62
|
+
|
63
|
+
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
|
64
|
+
1.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 2
|
65
|
+
2. Specification . . . . . . . . . . . . . . . . . . . . . . . . 3
|
66
|
+
2.1. Protocol definition . . . . . . . . . . . . . . . . . . . 3
|
67
|
+
2.2. Formal syntax . . . . . . . . . . . . . . . . . . . . . . 3
|
68
|
+
2.2.1. The user-agent line . . . . . . . . . . . . . . . . . 4
|
69
|
+
2.2.2. The Allow and Disallow lines . . . . . . . . . . . . 4
|
70
|
+
2.2.3. Special characters . . . . . . . . . . . . . . . . . 5
|
71
|
+
2.2.4. Other records . . . . . . . . . . . . . . . . . . . . 6
|
72
|
+
2.3. Access method . . . . . . . . . . . . . . . . . . . . . . 6
|
73
|
+
2.3.1. Access results . . . . . . . . . . . . . . . . . . . 7
|
74
|
+
2.4. Caching . . . . . . . . . . . . . . . . . . . . . . . . . 8
|
75
|
+
2.5. Limits . . . . . . . . . . . . . . . . . . . . . . . . . 8
|
76
|
+
2.6. Security Considerations . . . . . . . . . . . . . . . . . 8
|
77
|
+
2.7. IANA Considerations . . . . . . . . . . . . . . . . . . . 8
|
78
|
+
3. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 8
|
79
|
+
3.1. Simple example . . . . . . . . . . . . . . . . . . . . . 8
|
80
|
+
3.2. Longest Match . . . . . . . . . . . . . . . . . . . . . . 9
|
81
|
+
4. References . . . . . . . . . . . . . . . . . . . . . . . . . 9
|
82
|
+
4.1. Normative References . . . . . . . . . . . . . . . . . . 9
|
83
|
+
4.2. URIs . . . . . . . . . . . . . . . . . . . . . . . . . . 9
|
84
|
+
Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 10
|
85
|
+
|
86
|
+
1. Introduction
|
87
|
+
|
88
|
+
This document applies to services that provide resources that clients
|
89
|
+
can access through URIs as defined in RFC3986 [1]. For example, in
|
90
|
+
the context of HTTP, a browser is a client that displays the content
|
91
|
+
of a web page.
|
92
|
+
|
93
|
+
Crawlers are automated clients. Search engines for instance have
|
94
|
+
crawlers to recursively traverse links for indexing as defined in
|
95
|
+
RFC8288 [2].
|
96
|
+
|
97
|
+
It may be inconvenient for service owners if crawlers visit the
|
98
|
+
entirety of their URI space. This document specifies the rules that
|
99
|
+
crawlers MUST obey when accessing URIs.
|
100
|
+
|
101
|
+
These rules are not a form of access authorization.
|
102
|
+
|
103
|
+
1.1. Terminology
|
104
|
+
|
105
|
+
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
|
106
|
+
"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
|
107
|
+
"OPTIONAL" in this document are to be interpreted as described in
|
108
|
+
BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all
|
109
|
+
capitals, as shown here.
|
110
|
+
|
111
|
+
|
112
|
+
|
113
|
+
|
114
|
+
|
115
|
+
Koster, et al. Expires January 9, 2020 [Page 2]
|
116
|
+
|
117
|
+
Internet-Draft I-D July 2019
|
118
|
+
|
119
|
+
|
120
|
+
2. Specification
|
121
|
+
|
122
|
+
2.1. Protocol definition
|
123
|
+
|
124
|
+
The protocol language consists of rule(s) and group(s):
|
125
|
+
|
126
|
+
o *Rule*: A line with a key-value pair that defines how a crawler
|
127
|
+
may access URIs. See section The Allow and Disallow lines.
|
128
|
+
|
129
|
+
o *Group*: One or more user-agent lines that is followed by one or
|
130
|
+
more rules. The group is terminated by a user-agent line or end
|
131
|
+
of file. See User-agent line. The last group may have no rules,
|
132
|
+
which means it implicitly allows everything.
|
133
|
+
|
134
|
+
2.2. Formal syntax
|
135
|
+
|
136
|
+
Below is an Augmented Backus-Naur Form (ABNF) description, as
|
137
|
+
described in RFC5234 [3].
|
138
|
+
|
139
|
+
robotstxt = *(group / emptyline)
|
140
|
+
group = startgroupline ; We start with a user-agent
|
141
|
+
*(startgroupline / emptyline) ; ... and possibly more
|
142
|
+
; user-agents
|
143
|
+
*(rule / emptyline) ; followed by rules relevant
|
144
|
+
; for UAs
|
145
|
+
|
146
|
+
startgroupline = *WS "user-agent" *WS ":" *WS product-token EOL
|
147
|
+
|
148
|
+
rule = *WS ("allow" / "disallow") *WS ":"
|
149
|
+
*WS (path-pattern / empty-pattern) EOL
|
150
|
+
|
151
|
+
; parser implementors: add additional lines you need (for
|
152
|
+
; example Sitemaps), and be lenient when reading lines that don't
|
153
|
+
; conform. Apply Postel's law.
|
154
|
+
|
155
|
+
product-token = identifier / "*"
|
156
|
+
path-pattern = "/" *(UTF8-char-noctl) ; valid URI path pattern
|
157
|
+
empty-pattern = *WS
|
158
|
+
|
159
|
+
identifier = 1*(%x2d / %x41-5a / %x5f / %x61-7a)
|
160
|
+
comment = "#"*(UTF8-char-noctl / WS / "#")
|
161
|
+
emptyline = EOL EOL = *WS [comment] NL ; end-of-line may have
|
162
|
+
; optional trailing comment
|
163
|
+
NL = %x0D / %x0A / %x0D.0A
|
164
|
+
WS = %x20 / %x09
|
165
|
+
|
166
|
+
|
167
|
+
Koster, et al. Expires January 9, 2020 [Page 3]
|
168
|
+
|
169
|
+
Internet-Draft I-D July 2019
|
170
|
+
|
171
|
+
; UTF8 derived from RFC3629, but excluding control characters
|
172
|
+
|
173
|
+
UTF8-char-noctl = UTF8-1-noctl / UTF8-2 / UTF8-3 / UTF8-4
|
174
|
+
UTF8-1-noctl = %x21 / %x22 / %x24-7F ; excluding control, space, '#'
|
175
|
+
UTF8-2 = %xC2-DF UTF8-tail
|
176
|
+
UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) /
|
177
|
+
%xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail )
|
178
|
+
UTF8-4 = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F3 3( UTF8-tail ) /
|
179
|
+
%xF4 %x80-8F 2( UTF8-tail )
|
180
|
+
|
181
|
+
UTF8-tail = %x80-BF
|
182
|
+
|
183
|
+
2.2.1. The user-agent line
|
184
|
+
|
185
|
+
Crawlers set a product token to find relevant groups. The product
|
186
|
+
token MUST contain only "a-zA-Z_-" characters. The product token
|
187
|
+
SHOULD be part of the identification string that the crawler sends
|
188
|
+
to the service (for example, in the case of HTTP, the product name
|
189
|
+
SHOULD be in the user-agent header). The identification string
|
190
|
+
SHOULD describe the purpose of the crawler. Here's an example of an
|
191
|
+
HTTP header with a link pointing to a page describing the purpose of
|
192
|
+
the ExampleBot crawler which appears both in the HTTP header and as a
|
193
|
+
product token:
|
194
|
+
|
195
|
+
+-------------------------------------------------+-----------------+
|
196
|
+
| HTTP header | robots.txt |
|
197
|
+
| | user-agent line |
|
198
|
+
+-------------------------------------------------+-----------------+
|
199
|
+
| user-agent: Mozilla/5.0 (compatible; | user-agent: |
|
200
|
+
| ExampleBot/0.1; | ExampleBot |
|
201
|
+
| https://www.example.com/bot.html) | |
|
202
|
+
+-------------------------------------------------+-----------------+
|
203
|
+
|
204
|
+
Crawlers MUST find the group that matches the product token exactly,
|
205
|
+
and then obey the rules of the group. If there is more than one
|
206
|
+
group matching the user-agent, the matching groups' rules MUST be
|
207
|
+
combined into one group. The matching MUST be case-insensitive. If
|
208
|
+
no matching group exists, crawlers MUST obey the first group with a
|
209
|
+
user-agent line with a "*" value, if present. If no group satisfies
|
210
|
+
either condition, or no groups are present at all, no rules apply.
|
211
|
+
|
212
|
+
2.2.2. The Allow and Disallow lines
|
213
|
+
|
214
|
+
These lines indicate whether accessing a URI that matches the
|
215
|
+
corresponding path is allowed or disallowed.
|
216
|
+
|
217
|
+
To evaluate if access to a URI is allowed, a robot MUST match the
|
218
|
+
paths in allow and disallow rules against the URI. The matching
|
219
|
+
SHOULD be case sensitive. The most specific match found MUST be
|
220
|
+
used. The most specific match is the match that has the most octets.
|
221
|
+
If an allow and disallow rule is equivalent, the allow SHOULD be
|
222
|
+
used. If no match is found amongst the rules in a group for a
|
223
|
+
|
224
|
+
|
225
|
+
Koster, et al. Expires January 9, 2020 [Page 4]
|
226
|
+
|
227
|
+
Internet-Draft I-D July 2019
|
228
|
+
|
229
|
+
|
230
|
+
matching user-agent, or there are no rules in the group, the URI is
|
231
|
+
allowed. The /robots.txt URI is implicitly allowed.
|
232
|
+
|
233
|
+
Octets in the URI and robots.txt paths outside the range of the US-
|
234
|
+
ASCII coded character set, and those in the reserved range defined by
|
235
|
+
RFC3986 [1], MUST be percent-encoded as defined by RFC3986 [1] prior
|
236
|
+
to comparison.
|
237
|
+
|
238
|
+
If a percent-encoded US-ASCII octet is encountered in the URI, it
|
239
|
+
MUST be unencoded prior to comparison, unless it is a reserved
|
240
|
+
character in the URI as defined by RFC3986 [1] or the character is
|
241
|
+
outside the unreserved character range. The match evaluates
|
242
|
+
positively if and only if the end of the path from the rule is
|
243
|
+
reached before a difference in octets is encountered.
|
244
|
+
|
245
|
+
For example:
|
246
|
+
|
247
|
+
+-------------------+-----------------------+-----------------------+
|
248
|
+
| Path | Encoded Path | Path to match |
|
249
|
+
+-------------------+-----------------------+-----------------------+
|
250
|
+
| /foo/bar?baz=quz | /foo/bar?baz=quz | /foo/bar?baz=quz |
|
251
|
+
| | | |
|
252
|
+
| /foo/bar?baz=http | /foo/bar?baz=http%3A% | /foo/bar?baz=http%3A% |
|
253
|
+
| ://foo.bar | 2F%2Ffoo.bar | 2F%2Ffoo.bar |
|
254
|
+
| | | |
|
255
|
+
| /foo/bar/U+E38384 | /foo/bar/%E3%83%84 | /foo/bar/%E3%83%84 |
|
256
|
+
| | | |
|
257
|
+
| /foo/bar/%E3%83%8 | /foo/bar/%E3%83%84 | /foo/bar/%E3%83%84 |
|
258
|
+
| 4 | | |
|
259
|
+
| | | |
|
260
|
+
| /foo/bar/%62%61%7 | /foo/bar/%62%61%7A | /foo/bar/baz |
|
261
|
+
| A | | |
|
262
|
+
+-------------------+-----------------------+-----------------------+
|
263
|
+
|
264
|
+
The crawler SHOULD ignore "disallow" and "allow" rules that are not
|
265
|
+
in any group (for example, any rule that precedes the first user-
|
266
|
+
agent line).
|
267
|
+
|
268
|
+
Implementers MAY bridge encoding mismatches if they detect that the
|
269
|
+
robots.txt file is not UTF8 encoded.
|
270
|
+
|
271
|
+
2.2.3. Special characters
|
272
|
+
|
273
|
+
Crawlers SHOULD allow the following special characters:
|
274
|
+
|
275
|
+
|
276
|
+
|
277
|
+
|
278
|
+
|
279
|
+
|
280
|
+
|
281
|
+
Koster, et al. Expires January 9, 2020 [Page 5]
|
282
|
+
|
283
|
+
Internet-Draft I-D July 2019
|
284
|
+
|
285
|
+
|
286
|
+
+-----------+--------------------------------+----------------------+
|
287
|
+
| Character | Description | Example |
|
288
|
+
+-----------+--------------------------------+----------------------+
|
289
|
+
| "#" | Designates an end of line | "allow: / # comment |
|
290
|
+
| | comment. | in line" |
|
291
|
+
| | | |
|
292
|
+
| | | "# comment at the |
|
293
|
+
| | | end" |
|
294
|
+
| | | |
|
295
|
+
| "$" | Designates the end of the | "allow: |
|
296
|
+
| | match pattern. A URI MUST end | /this/path/exactly$" |
|
297
|
+
| | with a $. | |
|
298
|
+
| | | |
|
299
|
+
| "*" | Designates 0 or more instances | "allow: |
|
300
|
+
| | of any character. | /this/*/exactly" |
|
301
|
+
+-----------+--------------------------------+----------------------+
|
302
|
+
|
303
|
+
If crawlers match special characters verbatim in the URI, crawlers
|
304
|
+
SHOULD use "%" encoding. For example:
|
305
|
+
|
306
|
+
+------------------------+------------------------------------------+
|
307
|
+
| Pattern | URI |
|
308
|
+
+------------------------+------------------------------------------+
|
309
|
+
| /path/file- | https://www.example.com/path/file- |
|
310
|
+
| with-a-%2A.html | with-a-*.html |
|
311
|
+
| | |
|
312
|
+
| /path/foo-%24 | https://www.example.com/path/foo-$ |
|
313
|
+
+------------------------+------------------------------------------+
|
314
|
+
|
315
|
+
2.2.4. Other records
|
316
|
+
|
317
|
+
Clients MAY interpret other records that are not part of the
|
318
|
+
robots.txt protocol. For example, 'sitemap' [4].
|
319
|
+
|
320
|
+
2.3. Access method
|
321
|
+
|
322
|
+
The rules MUST be accessible in a file named "/robots.txt" (all lower
|
323
|
+
case) in the top level path of the service. The file MUST be UTF-8
|
324
|
+
encoded (as defined in RFC3629 [5]) and Internet Media Type "text/
|
325
|
+
plain" (as defined in RFC2046 [6]).
|
326
|
+
|
327
|
+
As per RFC3986 [1], the URI of the robots.txt is:
|
328
|
+
|
329
|
+
"scheme:[//authority]/robots.txt"
|
330
|
+
|
331
|
+
For example, in the context of HTTP or FTP, the URI is:
|
332
|
+
|
333
|
+
http://www.example.com/robots.txt
|
334
|
+
|
335
|
+
|
336
|
+
|
337
|
+
Koster, et al. Expires January 9, 2020 [Page 6]
|
338
|
+
|
339
|
+
Internet-Draft I-D July 2019
|
340
|
+
|
341
|
+
|
342
|
+
https://www.example.com/robots.txt
|
343
|
+
|
344
|
+
ftp://ftp.example.com/robots.txt
|
345
|
+
|
346
|
+
2.3.1. Access results
|
347
|
+
|
348
|
+
2.3.1.1. Successful access
|
349
|
+
|
350
|
+
If the crawler successfully downloads the robots.txt, the crawler
|
351
|
+
MUST follow the parseable rules.
|
352
|
+
|
353
|
+
2.3.1.2. Redirects
|
354
|
+
|
355
|
+
The server may respond to a robots.txt fetch request with a redirect,
|
356
|
+
such as HTTP 301 and HTTP 302. The crawlers SHOULD follow at least
|
357
|
+
five consecutive redirects, even across authorities (for example
|
358
|
+
hosts in case of HTTP), as defined in RFC1945 [7].
|
359
|
+
|
360
|
+
If a robots.txt file is reached within five consecutive redirects,
|
361
|
+
the robots.txt file MUST be fetched, parsed, and its rules followed
|
362
|
+
in the context of the initial authority.
|
363
|
+
|
364
|
+
If there are more than five consecutive redirects, crawlers MAY
|
365
|
+
assume that the robots.txt is unavailable.
|
366
|
+
|
367
|
+
2.3.1.3. Unavailable status
|
368
|
+
|
369
|
+
Unavailable means the crawler tries to fetch the robots.txt, and the
|
370
|
+
server responds with unavailable status codes. For example, in the
|
371
|
+
context of HTTP, unavailable status codes are in the 400-499 range.
|
372
|
+
|
373
|
+
If a server status code indicates that the robots.txt file is
|
374
|
+
unavailable to the client, then crawlers MAY access any resources on
|
375
|
+
the server or MAY use a cached version of a robots.txt file for up to
|
376
|
+
24 hours.
|
377
|
+
|
378
|
+
2.3.1.4. Unreachable status
|
379
|
+
|
380
|
+
If the robots.txt is unreachable due to server or network errors,
|
381
|
+
this means the robots.txt is undefined and the crawler MUST assume
|
382
|
+
complete disallow. For example, in the context of HTTP, an
|
383
|
+
unreachable robots.txt has a response code in the 500-599 range. For
|
384
|
+
other undefined status codes, the crawler MUST assume the robots.txt
|
385
|
+
is unreachable.
|
386
|
+
|
387
|
+
If the robots.txt is undefined for a reasonably long period of time
|
388
|
+
(for example, 30 days), clients MAY assume the robots.txt is
|
389
|
+
unavailable or continue to use a cached copy.
|
390
|
+
|
391
|
+
|
392
|
+
|
393
|
+
Koster, et al. Expires January 9, 2020 [Page 7]
|
394
|
+
|
395
|
+
Internet-Draft I-D July 2019
|
396
|
+
|
397
|
+
|
398
|
+
2.3.1.5. Parsing errors
|
399
|
+
|
400
|
+
Crawlers SHOULD try to parse each line of the robots.txt file.
|
401
|
+
Crawlers MUST use the parseable rules.
|
402
|
+
|
403
|
+
2.4. Caching
|
404
|
+
|
405
|
+
Crawlers MAY cache the fetched robots.txt file's contents. Crawlers
|
406
|
+
MAY use standard cache control as defined in RFC2616 [8]. Crawlers
|
407
|
+
SHOULD NOT use the cached version for more than 24 hours, unless the
|
408
|
+
robots.txt is unreachable.
|
409
|
+
|
410
|
+
2.5. Limits
|
411
|
+
|
412
|
+
Crawlers MAY impose a parsing limit that MUST be at least 500
|
413
|
+
kibibytes (KiB).
|
414
|
+
|
415
|
+
2.6. Security Considerations
|
416
|
+
|
417
|
+
The Robots Exclusion Protocol MUST NOT be used as a form of security
|
418
|
+
measures. Listing URIs in the robots.txt file exposes the URI
|
419
|
+
publicly and thus making the URIs discoverable.
|
420
|
+
|
421
|
+
2.7. IANA Considerations.
|
422
|
+
|
423
|
+
This document has no actions for IANA.
|
424
|
+
|
425
|
+
3. Examples
|
426
|
+
|
427
|
+
3.1. Simple example
|
428
|
+
|
429
|
+
The following example shows:
|
430
|
+
|
431
|
+
o *foobot*: A regular case. A single user-agent token followed by
|
432
|
+
rules.
|
433
|
+
o *barbot and bazbot*: A group that's relevant for more than one
|
434
|
+
user-agent.
|
435
|
+
o *quxbot:* Empty group at end of file.
|
436
|
+
|
437
|
+
Koster, et al. Expires January 9, 2020 [Page 8]
|
438
|
+
|
439
|
+
Internet-Draft I-D July 2019
|
440
|
+
|
441
|
+
<CODE BEGINS>
|
442
|
+
User-Agent : foobot
|
443
|
+
Disallow : /example/page.html
|
444
|
+
Disallow : /example/disallowed.gif
|
445
|
+
|
446
|
+
User-Agent : barbot
|
447
|
+
User-Agent : bazbot
|
448
|
+
Allow : /example/page.html
|
449
|
+
Disallow : /example/disallowed.gif
|
450
|
+
|
451
|
+
User-Agent: quxbot
|
452
|
+
|
453
|
+
EOF
|
454
|
+
<CODE ENDS>
|
455
|
+
|
456
|
+
3.2. Longest Match
|
457
|
+
|
458
|
+
The following example shows that in the case of a two rules, the
|
459
|
+
longest one MUST be used for matching. In the following case,
|
460
|
+
/example/page/disallowed.gif MUST be used for the URI
|
461
|
+
example.com/example/page/disallow.gif .
|
462
|
+
|
463
|
+
<CODE BEGINS>
|
464
|
+
User-Agent : foobot
|
465
|
+
Allow : /example/page/
|
466
|
+
Disallow : /example/page/disallowed.gif
|
467
|
+
<CODE ENDS>
|
468
|
+
|
469
|
+
4. References
|
470
|
+
|
471
|
+
4.1. Normative References
|
472
|
+
|
473
|
+
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
|
474
|
+
Requirement Levels", BCP 14, RFC 2119, March 1997.
|
475
|
+
[RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in
|
476
|
+
RFC 2119 Key Words", BCP 14, RFC 2119, May 2017.
|
477
|
+
|
478
|
+
4.2. URIs
|
479
|
+
|
480
|
+
[1] https://tools.ietf.org/html/rfc3986
|
481
|
+
|
482
|
+
[2] https://tools.ietf.org/html/rfc8288
|
483
|
+
|
484
|
+
[3] https://tools.ietf.org/html/rfc5234
|
485
|
+
|
486
|
+
[4] https://www.sitemaps.org/index.html
|
487
|
+
|
488
|
+
[5] https://tools.ietf.org/html/rfc3629
|
489
|
+
|
490
|
+
[6] https://tools.ietf.org/html/rfc2046
|
491
|
+
|
492
|
+
Koster, et al. Expires January 9, 2020 [Page 9]
|
493
|
+
|
494
|
+
Internet-Draft I-D July 2019
|
495
|
+
|
496
|
+
[7] https://tools.ietf.org/html/rfc1945
|
497
|
+
|
498
|
+
[8] https://tools.ietf.org/html/rfc2616
|
499
|
+
|
500
|
+
Authors' Address
|
501
|
+
|
502
|
+
Martijn Koster
|
503
|
+
Stalworthy Manor Farm
|
504
|
+
Suton Lane, NR18 9JG
|
505
|
+
Wymondham, Norfolk
|
506
|
+
United Kingdom
|
507
|
+
Email: m.koster@greenhills.co.uk
|
508
|
+
|
509
|
+
Gary Illyes
|
510
|
+
Brandschenkestrasse 110
|
511
|
+
8002, Zurich
|
512
|
+
Switzerland
|
513
|
+
Email: garyillyes@google.com
|
514
|
+
|
515
|
+
Henner Zeller
|
516
|
+
1600 Amphitheatre Pkwy
|
517
|
+
Mountain View, CA 94043
|
518
|
+
USA
|
519
|
+
Email: henner@google.com
|
520
|
+
|
521
|
+
Lizzi Harvey
|
522
|
+
1600 Amphitheatre Pkwy
|
523
|
+
Mountain View, CA 94043
|
524
|
+
USA
|
525
|
+
Email: lizzi@google.com
|
526
|
+
|
527
|
+
|
528
|
+
|
529
|
+
Koster, et al. Expires January 9, 2020 [Page 10]
|