google_robotstxt_parser 0.0.3

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,529 @@
1
+ Network Working Group M. Koster
2
+ Internet-Draft Stalworthy Computing, Ltd.
3
+ Intended status: Draft Standard G. Illyes
4
+ Expires: January 9, 2020 H. Zeller
5
+ L. Harvey
6
+ Google
7
+ July 07, 2019
8
+
9
+
10
+ Robots Exclusion Protocol
11
+ draft-koster-rep-00
12
+
13
+ Abstract
14
+
15
+ This document standardizes and extends the "Robots Exclusion
16
+ Protocol" <http://www.robotstxt.org/> method originally defined by
17
+ Martijn Koster in 1996 for service owners to control how content
18
+ served by their services may be accessed, if at all, by automatic
19
+ clients known as crawlers.
20
+
21
+ Status of This Memo
22
+
23
+ This Internet-Draft is submitted in full conformance with the
24
+ provisions of BCP 78 and BCP 79.
25
+
26
+ Internet-Drafts are working documents of the Internet Engineering
27
+ Task Force (IETF). Note that other groups may also distribute
28
+ working documents as Internet-Drafts. The list of current Internet-
29
+ Drafts is at http://datatracker.ietf.org/drafts/current/.
30
+
31
+ Internet-Drafts are draft documents valid for a maximum of six months
32
+ and may be updated, replaced, or obsoleted by other documents at any
33
+ time. It is inappropriate to use Internet-Drafts as reference
34
+ material or to cite them other than as "work in progress."
35
+
36
+ This document may not be modified, and derivative works of it may not
37
+ be created, except to format it for publication as an RFC or to
38
+ translate it into languages other than English.
39
+
40
+ This Internet-Draft will expire on January 9, 2020.
41
+
42
+ Copyright Notice
43
+
44
+ Copyright (c) 2019 IETF Trust and the persons identified as the
45
+ document authors. All rights reserved.
46
+
47
+ This document is subject to BCP 78 and the IETF Trust's Legal
48
+ Provisions Relating to IETF Documents
49
+ (http://trustee.ietf.org/license-info) in effect on the date of
50
+ publication of this document. Please review these documents
51
+ carefully, as they describe your rights and restrictions with respect
52
+ to this document. Code Components extracted from this document must
53
+ include Simplified BSD License text as described in Section 4.e of
54
+ the Trust Legal Provisions and are provided without warranty as
55
+ described in the Simplified BSD License.
56
+
57
+ Koster, et al. Expires January 9, 2020 [Page 1]
58
+
59
+ Internet-Draft I-D July 2019
60
+
61
+ Table of Contents
62
+
63
+ 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
64
+ 1.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 2
65
+ 2. Specification . . . . . . . . . . . . . . . . . . . . . . . . 3
66
+ 2.1. Protocol definition . . . . . . . . . . . . . . . . . . . 3
67
+ 2.2. Formal syntax . . . . . . . . . . . . . . . . . . . . . . 3
68
+ 2.2.1. The user-agent line . . . . . . . . . . . . . . . . . 4
69
+ 2.2.2. The Allow and Disallow lines . . . . . . . . . . . . 4
70
+ 2.2.3. Special characters . . . . . . . . . . . . . . . . . 5
71
+ 2.2.4. Other records . . . . . . . . . . . . . . . . . . . . 6
72
+ 2.3. Access method . . . . . . . . . . . . . . . . . . . . . . 6
73
+ 2.3.1. Access results . . . . . . . . . . . . . . . . . . . 7
74
+ 2.4. Caching . . . . . . . . . . . . . . . . . . . . . . . . . 8
75
+ 2.5. Limits . . . . . . . . . . . . . . . . . . . . . . . . . 8
76
+ 2.6. Security Considerations . . . . . . . . . . . . . . . . . 8
77
+ 2.7. IANA Considerations . . . . . . . . . . . . . . . . . . . 8
78
+ 3. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 8
79
+ 3.1. Simple example . . . . . . . . . . . . . . . . . . . . . 8
80
+ 3.2. Longest Match . . . . . . . . . . . . . . . . . . . . . . 9
81
+ 4. References . . . . . . . . . . . . . . . . . . . . . . . . . 9
82
+ 4.1. Normative References . . . . . . . . . . . . . . . . . . 9
83
+ 4.2. URIs . . . . . . . . . . . . . . . . . . . . . . . . . . 9
84
+ Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 10
85
+
86
+ 1. Introduction
87
+
88
+ This document applies to services that provide resources that clients
89
+ can access through URIs as defined in RFC3986 [1]. For example, in
90
+ the context of HTTP, a browser is a client that displays the content
91
+ of a web page.
92
+
93
+ Crawlers are automated clients. Search engines for instance have
94
+ crawlers to recursively traverse links for indexing as defined in
95
+ RFC8288 [2].
96
+
97
+ It may be inconvenient for service owners if crawlers visit the
98
+ entirety of their URI space. This document specifies the rules that
99
+ crawlers MUST obey when accessing URIs.
100
+
101
+ These rules are not a form of access authorization.
102
+
103
+ 1.1. Terminology
104
+
105
+ The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
106
+ "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
107
+ "OPTIONAL" in this document are to be interpreted as described in
108
+ BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all
109
+ capitals, as shown here.
110
+
111
+
112
+
113
+
114
+
115
+ Koster, et al. Expires January 9, 2020 [Page 2]
116
+
117
+ Internet-Draft I-D July 2019
118
+
119
+
120
+ 2. Specification
121
+
122
+ 2.1. Protocol definition
123
+
124
+ The protocol language consists of rule(s) and group(s):
125
+
126
+ o *Rule*: A line with a key-value pair that defines how a crawler
127
+ may access URIs. See section The Allow and Disallow lines.
128
+
129
+ o *Group*: One or more user-agent lines that is followed by one or
130
+ more rules. The group is terminated by a user-agent line or end
131
+ of file. See User-agent line. The last group may have no rules,
132
+ which means it implicitly allows everything.
133
+
134
+ 2.2. Formal syntax
135
+
136
+ Below is an Augmented Backus-Naur Form (ABNF) description, as
137
+ described in RFC5234 [3].
138
+
139
+ robotstxt = *(group / emptyline)
140
+ group = startgroupline ; We start with a user-agent
141
+ *(startgroupline / emptyline) ; ... and possibly more
142
+ ; user-agents
143
+ *(rule / emptyline) ; followed by rules relevant
144
+ ; for UAs
145
+
146
+ startgroupline = *WS "user-agent" *WS ":" *WS product-token EOL
147
+
148
+ rule = *WS ("allow" / "disallow") *WS ":"
149
+ *WS (path-pattern / empty-pattern) EOL
150
+
151
+ ; parser implementors: add additional lines you need (for
152
+ ; example Sitemaps), and be lenient when reading lines that don't
153
+ ; conform. Apply Postel's law.
154
+
155
+ product-token = identifier / "*"
156
+ path-pattern = "/" *(UTF8-char-noctl) ; valid URI path pattern
157
+ empty-pattern = *WS
158
+
159
+ identifier = 1*(%x2d / %x41-5a / %x5f / %x61-7a)
160
+ comment = "#"*(UTF8-char-noctl / WS / "#")
161
+ emptyline = EOL EOL = *WS [comment] NL ; end-of-line may have
162
+ ; optional trailing comment
163
+ NL = %x0D / %x0A / %x0D.0A
164
+ WS = %x20 / %x09
165
+
166
+
167
+ Koster, et al. Expires January 9, 2020 [Page 3]
168
+
169
+ Internet-Draft I-D July 2019
170
+
171
+ ; UTF8 derived from RFC3629, but excluding control characters
172
+
173
+ UTF8-char-noctl = UTF8-1-noctl / UTF8-2 / UTF8-3 / UTF8-4
174
+ UTF8-1-noctl = %x21 / %x22 / %x24-7F ; excluding control, space, '#'
175
+ UTF8-2 = %xC2-DF UTF8-tail
176
+ UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) /
177
+ %xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail )
178
+ UTF8-4 = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F3 3( UTF8-tail ) /
179
+ %xF4 %x80-8F 2( UTF8-tail )
180
+
181
+ UTF8-tail = %x80-BF
182
+
183
+ 2.2.1. The user-agent line
184
+
185
+ Crawlers set a product token to find relevant groups. The product
186
+ token MUST contain only "a-zA-Z_-" characters. The product token
187
+ SHOULD be part of the identification string that the crawler sends
188
+ to the service (for example, in the case of HTTP, the product name
189
+ SHOULD be in the user-agent header). The identification string
190
+ SHOULD describe the purpose of the crawler. Here's an example of an
191
+ HTTP header with a link pointing to a page describing the purpose of
192
+ the ExampleBot crawler which appears both in the HTTP header and as a
193
+ product token:
194
+
195
+ +-------------------------------------------------+-----------------+
196
+ | HTTP header | robots.txt |
197
+ | | user-agent line |
198
+ +-------------------------------------------------+-----------------+
199
+ | user-agent: Mozilla/5.0 (compatible; | user-agent: |
200
+ | ExampleBot/0.1; | ExampleBot |
201
+ | https://www.example.com/bot.html) | |
202
+ +-------------------------------------------------+-----------------+
203
+
204
+ Crawlers MUST find the group that matches the product token exactly,
205
+ and then obey the rules of the group. If there is more than one
206
+ group matching the user-agent, the matching groups' rules MUST be
207
+ combined into one group. The matching MUST be case-insensitive. If
208
+ no matching group exists, crawlers MUST obey the first group with a
209
+ user-agent line with a "*" value, if present. If no group satisfies
210
+ either condition, or no groups are present at all, no rules apply.
211
+
212
+ 2.2.2. The Allow and Disallow lines
213
+
214
+ These lines indicate whether accessing a URI that matches the
215
+ corresponding path is allowed or disallowed.
216
+
217
+ To evaluate if access to a URI is allowed, a robot MUST match the
218
+ paths in allow and disallow rules against the URI. The matching
219
+ SHOULD be case sensitive. The most specific match found MUST be
220
+ used. The most specific match is the match that has the most octets.
221
+ If an allow and disallow rule is equivalent, the allow SHOULD be
222
+ used. If no match is found amongst the rules in a group for a
223
+
224
+
225
+ Koster, et al. Expires January 9, 2020 [Page 4]
226
+
227
+ Internet-Draft I-D July 2019
228
+
229
+
230
+ matching user-agent, or there are no rules in the group, the URI is
231
+ allowed. The /robots.txt URI is implicitly allowed.
232
+
233
+ Octets in the URI and robots.txt paths outside the range of the US-
234
+ ASCII coded character set, and those in the reserved range defined by
235
+ RFC3986 [1], MUST be percent-encoded as defined by RFC3986 [1] prior
236
+ to comparison.
237
+
238
+ If a percent-encoded US-ASCII octet is encountered in the URI, it
239
+ MUST be unencoded prior to comparison, unless it is a reserved
240
+ character in the URI as defined by RFC3986 [1] or the character is
241
+ outside the unreserved character range. The match evaluates
242
+ positively if and only if the end of the path from the rule is
243
+ reached before a difference in octets is encountered.
244
+
245
+ For example:
246
+
247
+ +-------------------+-----------------------+-----------------------+
248
+ | Path | Encoded Path | Path to match |
249
+ +-------------------+-----------------------+-----------------------+
250
+ | /foo/bar?baz=quz | /foo/bar?baz=quz | /foo/bar?baz=quz |
251
+ | | | |
252
+ | /foo/bar?baz=http | /foo/bar?baz=http%3A% | /foo/bar?baz=http%3A% |
253
+ | ://foo.bar | 2F%2Ffoo.bar | 2F%2Ffoo.bar |
254
+ | | | |
255
+ | /foo/bar/U+E38384 | /foo/bar/%E3%83%84 | /foo/bar/%E3%83%84 |
256
+ | | | |
257
+ | /foo/bar/%E3%83%8 | /foo/bar/%E3%83%84 | /foo/bar/%E3%83%84 |
258
+ | 4 | | |
259
+ | | | |
260
+ | /foo/bar/%62%61%7 | /foo/bar/%62%61%7A | /foo/bar/baz |
261
+ | A | | |
262
+ +-------------------+-----------------------+-----------------------+
263
+
264
+ The crawler SHOULD ignore "disallow" and "allow" rules that are not
265
+ in any group (for example, any rule that precedes the first user-
266
+ agent line).
267
+
268
+ Implementers MAY bridge encoding mismatches if they detect that the
269
+ robots.txt file is not UTF8 encoded.
270
+
271
+ 2.2.3. Special characters
272
+
273
+ Crawlers SHOULD allow the following special characters:
274
+
275
+
276
+
277
+
278
+
279
+
280
+
281
+ Koster, et al. Expires January 9, 2020 [Page 5]
282
+
283
+ Internet-Draft I-D July 2019
284
+
285
+
286
+ +-----------+--------------------------------+----------------------+
287
+ | Character | Description | Example |
288
+ +-----------+--------------------------------+----------------------+
289
+ | "#" | Designates an end of line | "allow: / # comment |
290
+ | | comment. | in line" |
291
+ | | | |
292
+ | | | "# comment at the |
293
+ | | | end" |
294
+ | | | |
295
+ | "$" | Designates the end of the | "allow: |
296
+ | | match pattern. A URI MUST end | /this/path/exactly$" |
297
+ | | with a $. | |
298
+ | | | |
299
+ | "*" | Designates 0 or more instances | "allow: |
300
+ | | of any character. | /this/*/exactly" |
301
+ +-----------+--------------------------------+----------------------+
302
+
303
+ If crawlers match special characters verbatim in the URI, crawlers
304
+ SHOULD use "%" encoding. For example:
305
+
306
+ +------------------------+------------------------------------------+
307
+ | Pattern | URI |
308
+ +------------------------+------------------------------------------+
309
+ | /path/file- | https://www.example.com/path/file- |
310
+ | with-a-%2A.html | with-a-*.html |
311
+ | | |
312
+ | /path/foo-%24 | https://www.example.com/path/foo-$ |
313
+ +------------------------+------------------------------------------+
314
+
315
+ 2.2.4. Other records
316
+
317
+ Clients MAY interpret other records that are not part of the
318
+ robots.txt protocol. For example, 'sitemap' [4].
319
+
320
+ 2.3. Access method
321
+
322
+ The rules MUST be accessible in a file named "/robots.txt" (all lower
323
+ case) in the top level path of the service. The file MUST be UTF-8
324
+ encoded (as defined in RFC3629 [5]) and Internet Media Type "text/
325
+ plain" (as defined in RFC2046 [6]).
326
+
327
+ As per RFC3986 [1], the URI of the robots.txt is:
328
+
329
+ "scheme:[//authority]/robots.txt"
330
+
331
+ For example, in the context of HTTP or FTP, the URI is:
332
+
333
+ http://www.example.com/robots.txt
334
+
335
+
336
+
337
+ Koster, et al. Expires January 9, 2020 [Page 6]
338
+
339
+ Internet-Draft I-D July 2019
340
+
341
+
342
+ https://www.example.com/robots.txt
343
+
344
+ ftp://ftp.example.com/robots.txt
345
+
346
+ 2.3.1. Access results
347
+
348
+ 2.3.1.1. Successful access
349
+
350
+ If the crawler successfully downloads the robots.txt, the crawler
351
+ MUST follow the parseable rules.
352
+
353
+ 2.3.1.2. Redirects
354
+
355
+ The server may respond to a robots.txt fetch request with a redirect,
356
+ such as HTTP 301 and HTTP 302. The crawlers SHOULD follow at least
357
+ five consecutive redirects, even across authorities (for example
358
+ hosts in case of HTTP), as defined in RFC1945 [7].
359
+
360
+ If a robots.txt file is reached within five consecutive redirects,
361
+ the robots.txt file MUST be fetched, parsed, and its rules followed
362
+ in the context of the initial authority.
363
+
364
+ If there are more than five consecutive redirects, crawlers MAY
365
+ assume that the robots.txt is unavailable.
366
+
367
+ 2.3.1.3. Unavailable status
368
+
369
+ Unavailable means the crawler tries to fetch the robots.txt, and the
370
+ server responds with unavailable status codes. For example, in the
371
+ context of HTTP, unavailable status codes are in the 400-499 range.
372
+
373
+ If a server status code indicates that the robots.txt file is
374
+ unavailable to the client, then crawlers MAY access any resources on
375
+ the server or MAY use a cached version of a robots.txt file for up to
376
+ 24 hours.
377
+
378
+ 2.3.1.4. Unreachable status
379
+
380
+ If the robots.txt is unreachable due to server or network errors,
381
+ this means the robots.txt is undefined and the crawler MUST assume
382
+ complete disallow. For example, in the context of HTTP, an
383
+ unreachable robots.txt has a response code in the 500-599 range. For
384
+ other undefined status codes, the crawler MUST assume the robots.txt
385
+ is unreachable.
386
+
387
+ If the robots.txt is undefined for a reasonably long period of time
388
+ (for example, 30 days), clients MAY assume the robots.txt is
389
+ unavailable or continue to use a cached copy.
390
+
391
+
392
+
393
+ Koster, et al. Expires January 9, 2020 [Page 7]
394
+
395
+ Internet-Draft I-D July 2019
396
+
397
+
398
+ 2.3.1.5. Parsing errors
399
+
400
+ Crawlers SHOULD try to parse each line of the robots.txt file.
401
+ Crawlers MUST use the parseable rules.
402
+
403
+ 2.4. Caching
404
+
405
+ Crawlers MAY cache the fetched robots.txt file's contents. Crawlers
406
+ MAY use standard cache control as defined in RFC2616 [8]. Crawlers
407
+ SHOULD NOT use the cached version for more than 24 hours, unless the
408
+ robots.txt is unreachable.
409
+
410
+ 2.5. Limits
411
+
412
+ Crawlers MAY impose a parsing limit that MUST be at least 500
413
+ kibibytes (KiB).
414
+
415
+ 2.6. Security Considerations
416
+
417
+ The Robots Exclusion Protocol MUST NOT be used as a form of security
418
+ measures. Listing URIs in the robots.txt file exposes the URI
419
+ publicly and thus making the URIs discoverable.
420
+
421
+ 2.7. IANA Considerations.
422
+
423
+ This document has no actions for IANA.
424
+
425
+ 3. Examples
426
+
427
+ 3.1. Simple example
428
+
429
+ The following example shows:
430
+
431
+ o *foobot*: A regular case. A single user-agent token followed by
432
+ rules.
433
+ o *barbot and bazbot*: A group that's relevant for more than one
434
+ user-agent.
435
+ o *quxbot:* Empty group at end of file.
436
+
437
+ Koster, et al. Expires January 9, 2020 [Page 8]
438
+
439
+ Internet-Draft I-D July 2019
440
+
441
+ <CODE BEGINS>
442
+ User-Agent : foobot
443
+ Disallow : /example/page.html
444
+ Disallow : /example/disallowed.gif
445
+
446
+ User-Agent : barbot
447
+ User-Agent : bazbot
448
+ Allow : /example/page.html
449
+ Disallow : /example/disallowed.gif
450
+
451
+ User-Agent: quxbot
452
+
453
+ EOF
454
+ <CODE ENDS>
455
+
456
+ 3.2. Longest Match
457
+
458
+ The following example shows that in the case of a two rules, the
459
+ longest one MUST be used for matching. In the following case,
460
+ /example/page/disallowed.gif MUST be used for the URI
461
+ example.com/example/page/disallow.gif .
462
+
463
+ <CODE BEGINS>
464
+ User-Agent : foobot
465
+ Allow : /example/page/
466
+ Disallow : /example/page/disallowed.gif
467
+ <CODE ENDS>
468
+
469
+ 4. References
470
+
471
+ 4.1. Normative References
472
+
473
+ [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
474
+ Requirement Levels", BCP 14, RFC 2119, March 1997.
475
+ [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in
476
+ RFC 2119 Key Words", BCP 14, RFC 2119, May 2017.
477
+
478
+ 4.2. URIs
479
+
480
+ [1] https://tools.ietf.org/html/rfc3986
481
+
482
+ [2] https://tools.ietf.org/html/rfc8288
483
+
484
+ [3] https://tools.ietf.org/html/rfc5234
485
+
486
+ [4] https://www.sitemaps.org/index.html
487
+
488
+ [5] https://tools.ietf.org/html/rfc3629
489
+
490
+ [6] https://tools.ietf.org/html/rfc2046
491
+
492
+ Koster, et al. Expires January 9, 2020 [Page 9]
493
+
494
+ Internet-Draft I-D July 2019
495
+
496
+ [7] https://tools.ietf.org/html/rfc1945
497
+
498
+ [8] https://tools.ietf.org/html/rfc2616
499
+
500
+ Authors' Address
501
+
502
+ Martijn Koster
503
+ Stalworthy Manor Farm
504
+ Suton Lane, NR18 9JG
505
+ Wymondham, Norfolk
506
+ United Kingdom
507
+ Email: m.koster@greenhills.co.uk
508
+
509
+ Gary Illyes
510
+ Brandschenkestrasse 110
511
+ 8002, Zurich
512
+ Switzerland
513
+ Email: garyillyes@google.com
514
+
515
+ Henner Zeller
516
+ 1600 Amphitheatre Pkwy
517
+ Mountain View, CA 94043
518
+ USA
519
+ Email: henner@google.com
520
+
521
+ Lizzi Harvey
522
+ 1600 Amphitheatre Pkwy
523
+ Mountain View, CA 94043
524
+ USA
525
+ Email: lizzi@google.com
526
+
527
+
528
+
529
+ Koster, et al. Expires January 9, 2020 [Page 10]