google_robotstxt_parser 0.0.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,529 @@
1
+ Network Working Group M. Koster
2
+ Internet-Draft Stalworthy Computing, Ltd.
3
+ Intended status: Draft Standard G. Illyes
4
+ Expires: January 9, 2020 H. Zeller
5
+ L. Harvey
6
+ Google
7
+ July 07, 2019
8
+
9
+
10
+ Robots Exclusion Protocol
11
+ draft-koster-rep-00
12
+
13
+ Abstract
14
+
15
+ This document standardizes and extends the "Robots Exclusion
16
+ Protocol" <http://www.robotstxt.org/> method originally defined by
17
+ Martijn Koster in 1996 for service owners to control how content
18
+ served by their services may be accessed, if at all, by automatic
19
+ clients known as crawlers.
20
+
21
+ Status of This Memo
22
+
23
+ This Internet-Draft is submitted in full conformance with the
24
+ provisions of BCP 78 and BCP 79.
25
+
26
+ Internet-Drafts are working documents of the Internet Engineering
27
+ Task Force (IETF). Note that other groups may also distribute
28
+ working documents as Internet-Drafts. The list of current Internet-
29
+ Drafts is at http://datatracker.ietf.org/drafts/current/.
30
+
31
+ Internet-Drafts are draft documents valid for a maximum of six months
32
+ and may be updated, replaced, or obsoleted by other documents at any
33
+ time. It is inappropriate to use Internet-Drafts as reference
34
+ material or to cite them other than as "work in progress."
35
+
36
+ This document may not be modified, and derivative works of it may not
37
+ be created, except to format it for publication as an RFC or to
38
+ translate it into languages other than English.
39
+
40
+ This Internet-Draft will expire on January 9, 2020.
41
+
42
+ Copyright Notice
43
+
44
+ Copyright (c) 2019 IETF Trust and the persons identified as the
45
+ document authors. All rights reserved.
46
+
47
+ This document is subject to BCP 78 and the IETF Trust's Legal
48
+ Provisions Relating to IETF Documents
49
+ (http://trustee.ietf.org/license-info) in effect on the date of
50
+ publication of this document. Please review these documents
51
+ carefully, as they describe your rights and restrictions with respect
52
+ to this document. Code Components extracted from this document must
53
+ include Simplified BSD License text as described in Section 4.e of
54
+ the Trust Legal Provisions and are provided without warranty as
55
+ described in the Simplified BSD License.
56
+
57
+ Koster, et al. Expires January 9, 2020 [Page 1]
58
+
59
+ Internet-Draft I-D July 2019
60
+
61
+ Table of Contents
62
+
63
+ 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
64
+ 1.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 2
65
+ 2. Specification . . . . . . . . . . . . . . . . . . . . . . . . 3
66
+ 2.1. Protocol definition . . . . . . . . . . . . . . . . . . . 3
67
+ 2.2. Formal syntax . . . . . . . . . . . . . . . . . . . . . . 3
68
+ 2.2.1. The user-agent line . . . . . . . . . . . . . . . . . 4
69
+ 2.2.2. The Allow and Disallow lines . . . . . . . . . . . . 4
70
+ 2.2.3. Special characters . . . . . . . . . . . . . . . . . 5
71
+ 2.2.4. Other records . . . . . . . . . . . . . . . . . . . . 6
72
+ 2.3. Access method . . . . . . . . . . . . . . . . . . . . . . 6
73
+ 2.3.1. Access results . . . . . . . . . . . . . . . . . . . 7
74
+ 2.4. Caching . . . . . . . . . . . . . . . . . . . . . . . . . 8
75
+ 2.5. Limits . . . . . . . . . . . . . . . . . . . . . . . . . 8
76
+ 2.6. Security Considerations . . . . . . . . . . . . . . . . . 8
77
+ 2.7. IANA Considerations . . . . . . . . . . . . . . . . . . . 8
78
+ 3. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 8
79
+ 3.1. Simple example . . . . . . . . . . . . . . . . . . . . . 8
80
+ 3.2. Longest Match . . . . . . . . . . . . . . . . . . . . . . 9
81
+ 4. References . . . . . . . . . . . . . . . . . . . . . . . . . 9
82
+ 4.1. Normative References . . . . . . . . . . . . . . . . . . 9
83
+ 4.2. URIs . . . . . . . . . . . . . . . . . . . . . . . . . . 9
84
+ Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 10
85
+
86
+ 1. Introduction
87
+
88
+ This document applies to services that provide resources that clients
89
+ can access through URIs as defined in RFC3986 [1]. For example, in
90
+ the context of HTTP, a browser is a client that displays the content
91
+ of a web page.
92
+
93
+ Crawlers are automated clients. Search engines for instance have
94
+ crawlers to recursively traverse links for indexing as defined in
95
+ RFC8288 [2].
96
+
97
+ It may be inconvenient for service owners if crawlers visit the
98
+ entirety of their URI space. This document specifies the rules that
99
+ crawlers MUST obey when accessing URIs.
100
+
101
+ These rules are not a form of access authorization.
102
+
103
+ 1.1. Terminology
104
+
105
+ The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
106
+ "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
107
+ "OPTIONAL" in this document are to be interpreted as described in
108
+ BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all
109
+ capitals, as shown here.
110
+
111
+
112
+
113
+
114
+
115
+ Koster, et al. Expires January 9, 2020 [Page 2]
116
+
117
+ Internet-Draft I-D July 2019
118
+
119
+
120
+ 2. Specification
121
+
122
+ 2.1. Protocol definition
123
+
124
+ The protocol language consists of rule(s) and group(s):
125
+
126
+ o *Rule*: A line with a key-value pair that defines how a crawler
127
+ may access URIs. See section The Allow and Disallow lines.
128
+
129
+ o *Group*: One or more user-agent lines that is followed by one or
130
+ more rules. The group is terminated by a user-agent line or end
131
+ of file. See User-agent line. The last group may have no rules,
132
+ which means it implicitly allows everything.
133
+
134
+ 2.2. Formal syntax
135
+
136
+ Below is an Augmented Backus-Naur Form (ABNF) description, as
137
+ described in RFC5234 [3].
138
+
139
+ robotstxt = *(group / emptyline)
140
+ group = startgroupline ; We start with a user-agent
141
+ *(startgroupline / emptyline) ; ... and possibly more
142
+ ; user-agents
143
+ *(rule / emptyline) ; followed by rules relevant
144
+ ; for UAs
145
+
146
+ startgroupline = *WS "user-agent" *WS ":" *WS product-token EOL
147
+
148
+ rule = *WS ("allow" / "disallow") *WS ":"
149
+ *WS (path-pattern / empty-pattern) EOL
150
+
151
+ ; parser implementors: add additional lines you need (for
152
+ ; example Sitemaps), and be lenient when reading lines that don't
153
+ ; conform. Apply Postel's law.
154
+
155
+ product-token = identifier / "*"
156
+ path-pattern = "/" *(UTF8-char-noctl) ; valid URI path pattern
157
+ empty-pattern = *WS
158
+
159
+ identifier = 1*(%x2d / %x41-5a / %x5f / %x61-7a)
160
+ comment = "#"*(UTF8-char-noctl / WS / "#")
161
+ emptyline = EOL EOL = *WS [comment] NL ; end-of-line may have
162
+ ; optional trailing comment
163
+ NL = %x0D / %x0A / %x0D.0A
164
+ WS = %x20 / %x09
165
+
166
+
167
+ Koster, et al. Expires January 9, 2020 [Page 3]
168
+
169
+ Internet-Draft I-D July 2019
170
+
171
+ ; UTF8 derived from RFC3629, but excluding control characters
172
+
173
+ UTF8-char-noctl = UTF8-1-noctl / UTF8-2 / UTF8-3 / UTF8-4
174
+ UTF8-1-noctl = %x21 / %x22 / %x24-7F ; excluding control, space, '#'
175
+ UTF8-2 = %xC2-DF UTF8-tail
176
+ UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) /
177
+ %xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail )
178
+ UTF8-4 = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F3 3( UTF8-tail ) /
179
+ %xF4 %x80-8F 2( UTF8-tail )
180
+
181
+ UTF8-tail = %x80-BF
182
+
183
+ 2.2.1. The user-agent line
184
+
185
+ Crawlers set a product token to find relevant groups. The product
186
+ token MUST contain only "a-zA-Z_-" characters. The product token
187
+ SHOULD be part of the identification string that the crawler sends
188
+ to the service (for example, in the case of HTTP, the product name
189
+ SHOULD be in the user-agent header). The identification string
190
+ SHOULD describe the purpose of the crawler. Here's an example of an
191
+ HTTP header with a link pointing to a page describing the purpose of
192
+ the ExampleBot crawler which appears both in the HTTP header and as a
193
+ product token:
194
+
195
+ +-------------------------------------------------+-----------------+
196
+ | HTTP header | robots.txt |
197
+ | | user-agent line |
198
+ +-------------------------------------------------+-----------------+
199
+ | user-agent: Mozilla/5.0 (compatible; | user-agent: |
200
+ | ExampleBot/0.1; | ExampleBot |
201
+ | https://www.example.com/bot.html) | |
202
+ +-------------------------------------------------+-----------------+
203
+
204
+ Crawlers MUST find the group that matches the product token exactly,
205
+ and then obey the rules of the group. If there is more than one
206
+ group matching the user-agent, the matching groups' rules MUST be
207
+ combined into one group. The matching MUST be case-insensitive. If
208
+ no matching group exists, crawlers MUST obey the first group with a
209
+ user-agent line with a "*" value, if present. If no group satisfies
210
+ either condition, or no groups are present at all, no rules apply.
211
+
212
+ 2.2.2. The Allow and Disallow lines
213
+
214
+ These lines indicate whether accessing a URI that matches the
215
+ corresponding path is allowed or disallowed.
216
+
217
+ To evaluate if access to a URI is allowed, a robot MUST match the
218
+ paths in allow and disallow rules against the URI. The matching
219
+ SHOULD be case sensitive. The most specific match found MUST be
220
+ used. The most specific match is the match that has the most octets.
221
+ If an allow and disallow rule is equivalent, the allow SHOULD be
222
+ used. If no match is found amongst the rules in a group for a
223
+
224
+
225
+ Koster, et al. Expires January 9, 2020 [Page 4]
226
+
227
+ Internet-Draft I-D July 2019
228
+
229
+
230
+ matching user-agent, or there are no rules in the group, the URI is
231
+ allowed. The /robots.txt URI is implicitly allowed.
232
+
233
+ Octets in the URI and robots.txt paths outside the range of the US-
234
+ ASCII coded character set, and those in the reserved range defined by
235
+ RFC3986 [1], MUST be percent-encoded as defined by RFC3986 [1] prior
236
+ to comparison.
237
+
238
+ If a percent-encoded US-ASCII octet is encountered in the URI, it
239
+ MUST be unencoded prior to comparison, unless it is a reserved
240
+ character in the URI as defined by RFC3986 [1] or the character is
241
+ outside the unreserved character range. The match evaluates
242
+ positively if and only if the end of the path from the rule is
243
+ reached before a difference in octets is encountered.
244
+
245
+ For example:
246
+
247
+ +-------------------+-----------------------+-----------------------+
248
+ | Path | Encoded Path | Path to match |
249
+ +-------------------+-----------------------+-----------------------+
250
+ | /foo/bar?baz=quz | /foo/bar?baz=quz | /foo/bar?baz=quz |
251
+ | | | |
252
+ | /foo/bar?baz=http | /foo/bar?baz=http%3A% | /foo/bar?baz=http%3A% |
253
+ | ://foo.bar | 2F%2Ffoo.bar | 2F%2Ffoo.bar |
254
+ | | | |
255
+ | /foo/bar/U+E38384 | /foo/bar/%E3%83%84 | /foo/bar/%E3%83%84 |
256
+ | | | |
257
+ | /foo/bar/%E3%83%8 | /foo/bar/%E3%83%84 | /foo/bar/%E3%83%84 |
258
+ | 4 | | |
259
+ | | | |
260
+ | /foo/bar/%62%61%7 | /foo/bar/%62%61%7A | /foo/bar/baz |
261
+ | A | | |
262
+ +-------------------+-----------------------+-----------------------+
263
+
264
+ The crawler SHOULD ignore "disallow" and "allow" rules that are not
265
+ in any group (for example, any rule that precedes the first user-
266
+ agent line).
267
+
268
+ Implementers MAY bridge encoding mismatches if they detect that the
269
+ robots.txt file is not UTF8 encoded.
270
+
271
+ 2.2.3. Special characters
272
+
273
+ Crawlers SHOULD allow the following special characters:
274
+
275
+
276
+
277
+
278
+
279
+
280
+
281
+ Koster, et al. Expires January 9, 2020 [Page 5]
282
+
283
+ Internet-Draft I-D July 2019
284
+
285
+
286
+ +-----------+--------------------------------+----------------------+
287
+ | Character | Description | Example |
288
+ +-----------+--------------------------------+----------------------+
289
+ | "#" | Designates an end of line | "allow: / # comment |
290
+ | | comment. | in line" |
291
+ | | | |
292
+ | | | "# comment at the |
293
+ | | | end" |
294
+ | | | |
295
+ | "$" | Designates the end of the | "allow: |
296
+ | | match pattern. A URI MUST end | /this/path/exactly$" |
297
+ | | with a $. | |
298
+ | | | |
299
+ | "*" | Designates 0 or more instances | "allow: |
300
+ | | of any character. | /this/*/exactly" |
301
+ +-----------+--------------------------------+----------------------+
302
+
303
+ If crawlers match special characters verbatim in the URI, crawlers
304
+ SHOULD use "%" encoding. For example:
305
+
306
+ +------------------------+------------------------------------------+
307
+ | Pattern | URI |
308
+ +------------------------+------------------------------------------+
309
+ | /path/file- | https://www.example.com/path/file- |
310
+ | with-a-%2A.html | with-a-*.html |
311
+ | | |
312
+ | /path/foo-%24 | https://www.example.com/path/foo-$ |
313
+ +------------------------+------------------------------------------+
314
+
315
+ 2.2.4. Other records
316
+
317
+ Clients MAY interpret other records that are not part of the
318
+ robots.txt protocol. For example, 'sitemap' [4].
319
+
320
+ 2.3. Access method
321
+
322
+ The rules MUST be accessible in a file named "/robots.txt" (all lower
323
+ case) in the top level path of the service. The file MUST be UTF-8
324
+ encoded (as defined in RFC3629 [5]) and Internet Media Type "text/
325
+ plain" (as defined in RFC2046 [6]).
326
+
327
+ As per RFC3986 [1], the URI of the robots.txt is:
328
+
329
+ "scheme:[//authority]/robots.txt"
330
+
331
+ For example, in the context of HTTP or FTP, the URI is:
332
+
333
+ http://www.example.com/robots.txt
334
+
335
+
336
+
337
+ Koster, et al. Expires January 9, 2020 [Page 6]
338
+
339
+ Internet-Draft I-D July 2019
340
+
341
+
342
+ https://www.example.com/robots.txt
343
+
344
+ ftp://ftp.example.com/robots.txt
345
+
346
+ 2.3.1. Access results
347
+
348
+ 2.3.1.1. Successful access
349
+
350
+ If the crawler successfully downloads the robots.txt, the crawler
351
+ MUST follow the parseable rules.
352
+
353
+ 2.3.1.2. Redirects
354
+
355
+ The server may respond to a robots.txt fetch request with a redirect,
356
+ such as HTTP 301 and HTTP 302. The crawlers SHOULD follow at least
357
+ five consecutive redirects, even across authorities (for example
358
+ hosts in case of HTTP), as defined in RFC1945 [7].
359
+
360
+ If a robots.txt file is reached within five consecutive redirects,
361
+ the robots.txt file MUST be fetched, parsed, and its rules followed
362
+ in the context of the initial authority.
363
+
364
+ If there are more than five consecutive redirects, crawlers MAY
365
+ assume that the robots.txt is unavailable.
366
+
367
+ 2.3.1.3. Unavailable status
368
+
369
+ Unavailable means the crawler tries to fetch the robots.txt, and the
370
+ server responds with unavailable status codes. For example, in the
371
+ context of HTTP, unavailable status codes are in the 400-499 range.
372
+
373
+ If a server status code indicates that the robots.txt file is
374
+ unavailable to the client, then crawlers MAY access any resources on
375
+ the server or MAY use a cached version of a robots.txt file for up to
376
+ 24 hours.
377
+
378
+ 2.3.1.4. Unreachable status
379
+
380
+ If the robots.txt is unreachable due to server or network errors,
381
+ this means the robots.txt is undefined and the crawler MUST assume
382
+ complete disallow. For example, in the context of HTTP, an
383
+ unreachable robots.txt has a response code in the 500-599 range. For
384
+ other undefined status codes, the crawler MUST assume the robots.txt
385
+ is unreachable.
386
+
387
+ If the robots.txt is undefined for a reasonably long period of time
388
+ (for example, 30 days), clients MAY assume the robots.txt is
389
+ unavailable or continue to use a cached copy.
390
+
391
+
392
+
393
+ Koster, et al. Expires January 9, 2020 [Page 7]
394
+
395
+ Internet-Draft I-D July 2019
396
+
397
+
398
+ 2.3.1.5. Parsing errors
399
+
400
+ Crawlers SHOULD try to parse each line of the robots.txt file.
401
+ Crawlers MUST use the parseable rules.
402
+
403
+ 2.4. Caching
404
+
405
+ Crawlers MAY cache the fetched robots.txt file's contents. Crawlers
406
+ MAY use standard cache control as defined in RFC2616 [8]. Crawlers
407
+ SHOULD NOT use the cached version for more than 24 hours, unless the
408
+ robots.txt is unreachable.
409
+
410
+ 2.5. Limits
411
+
412
+ Crawlers MAY impose a parsing limit that MUST be at least 500
413
+ kibibytes (KiB).
414
+
415
+ 2.6. Security Considerations
416
+
417
+ The Robots Exclusion Protocol MUST NOT be used as a form of security
418
+ measures. Listing URIs in the robots.txt file exposes the URI
419
+ publicly and thus making the URIs discoverable.
420
+
421
+ 2.7. IANA Considerations.
422
+
423
+ This document has no actions for IANA.
424
+
425
+ 3. Examples
426
+
427
+ 3.1. Simple example
428
+
429
+ The following example shows:
430
+
431
+ o *foobot*: A regular case. A single user-agent token followed by
432
+ rules.
433
+ o *barbot and bazbot*: A group that's relevant for more than one
434
+ user-agent.
435
+ o *quxbot:* Empty group at end of file.
436
+
437
+ Koster, et al. Expires January 9, 2020 [Page 8]
438
+
439
+ Internet-Draft I-D July 2019
440
+
441
+ <CODE BEGINS>
442
+ User-Agent : foobot
443
+ Disallow : /example/page.html
444
+ Disallow : /example/disallowed.gif
445
+
446
+ User-Agent : barbot
447
+ User-Agent : bazbot
448
+ Allow : /example/page.html
449
+ Disallow : /example/disallowed.gif
450
+
451
+ User-Agent: quxbot
452
+
453
+ EOF
454
+ <CODE ENDS>
455
+
456
+ 3.2. Longest Match
457
+
458
+ The following example shows that in the case of a two rules, the
459
+ longest one MUST be used for matching. In the following case,
460
+ /example/page/disallowed.gif MUST be used for the URI
461
+ example.com/example/page/disallow.gif .
462
+
463
+ <CODE BEGINS>
464
+ User-Agent : foobot
465
+ Allow : /example/page/
466
+ Disallow : /example/page/disallowed.gif
467
+ <CODE ENDS>
468
+
469
+ 4. References
470
+
471
+ 4.1. Normative References
472
+
473
+ [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
474
+ Requirement Levels", BCP 14, RFC 2119, March 1997.
475
+ [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in
476
+ RFC 2119 Key Words", BCP 14, RFC 2119, May 2017.
477
+
478
+ 4.2. URIs
479
+
480
+ [1] https://tools.ietf.org/html/rfc3986
481
+
482
+ [2] https://tools.ietf.org/html/rfc8288
483
+
484
+ [3] https://tools.ietf.org/html/rfc5234
485
+
486
+ [4] https://www.sitemaps.org/index.html
487
+
488
+ [5] https://tools.ietf.org/html/rfc3629
489
+
490
+ [6] https://tools.ietf.org/html/rfc2046
491
+
492
+ Koster, et al. Expires January 9, 2020 [Page 9]
493
+
494
+ Internet-Draft I-D July 2019
495
+
496
+ [7] https://tools.ietf.org/html/rfc1945
497
+
498
+ [8] https://tools.ietf.org/html/rfc2616
499
+
500
+ Authors' Address
501
+
502
+ Martijn Koster
503
+ Stalworthy Manor Farm
504
+ Suton Lane, NR18 9JG
505
+ Wymondham, Norfolk
506
+ United Kingdom
507
+ Email: m.koster@greenhills.co.uk
508
+
509
+ Gary Illyes
510
+ Brandschenkestrasse 110
511
+ 8002, Zurich
512
+ Switzerland
513
+ Email: garyillyes@google.com
514
+
515
+ Henner Zeller
516
+ 1600 Amphitheatre Pkwy
517
+ Mountain View, CA 94043
518
+ USA
519
+ Email: henner@google.com
520
+
521
+ Lizzi Harvey
522
+ 1600 Amphitheatre Pkwy
523
+ Mountain View, CA 94043
524
+ USA
525
+ Email: lizzi@google.com
526
+
527
+
528
+
529
+ Koster, et al. Expires January 9, 2020 [Page 10]