addressable 2.2.1 → 2.2.2

Sign up to get free protection for your applications and to get access to all the features.
data/CHANGELOG CHANGED
@@ -1,3 +1,6 @@
1
+ === Addressable 2.2.2
2
+ - fixed issue with percent escaping of '+' character in query strings
3
+
1
4
  === Addressable 2.2.1
2
5
  - added support for application/x-www-form-urlencoded.
3
6
 
@@ -1430,7 +1430,7 @@ module Addressable
1430
1430
  value = true if value.nil?
1431
1431
  key = self.class.unencode_component(key)
1432
1432
  if value != true
1433
- value = self.class.unencode_component(value).gsub(/\+/, " ")
1433
+ value = self.class.unencode_component(value.gsub(/\+/, " "))
1434
1434
  end
1435
1435
  if options[:notation] == :flat
1436
1436
  if accumulator[key]
@@ -28,7 +28,7 @@ if !defined?(Addressable::VERSION)
28
28
  module VERSION #:nodoc:
29
29
  MAJOR = 2
30
30
  MINOR = 2
31
- TINY = 1
31
+ TINY = 2
32
32
 
33
33
  STRING = [MAJOR, MINOR, TINY].join('.')
34
34
  end
@@ -2543,6 +2543,36 @@ describe Addressable::URI, "when parsed from " +
2543
2543
  end
2544
2544
  end
2545
2545
 
2546
+ describe Addressable::URI, "when parsed from " +
2547
+ "'http://example.com/?q=a+b'" do
2548
+ before do
2549
+ @uri = Addressable::URI.parse("http://example.com/?q=a+b")
2550
+ end
2551
+
2552
+ it "should have a query of 'q=a+b'" do
2553
+ @uri.query.should == "q=a+b"
2554
+ end
2555
+
2556
+ it "should have query_values of {'q' => 'a b'}" do
2557
+ @uri.query_values.should == {'q' => 'a b'}
2558
+ end
2559
+ end
2560
+
2561
+ describe Addressable::URI, "when parsed from " +
2562
+ "'http://example.com/?q=a%2bb'" do
2563
+ before do
2564
+ @uri = Addressable::URI.parse("http://example.com/?q=a%2bb")
2565
+ end
2566
+
2567
+ it "should have a query of 'q=a+b'" do
2568
+ @uri.query.should == "q=a%2bb"
2569
+ end
2570
+
2571
+ it "should have query_values of {'q' => 'a+b'}" do
2572
+ @uri.query_values.should == {'q' => 'a+b'}
2573
+ end
2574
+ end
2575
+
2546
2576
  describe Addressable::URI, "when parsed from " +
2547
2577
  "'http://example.com/?q='" do
2548
2578
  before do
@@ -1,6 +1,6 @@
1
1
  namespace :gem do
2
2
  desc 'Package and upload to RubyForge'
3
- task :release => ["gem:package"] do |t|
3
+ task :release => ["gem:package", "gem:gemspec"] do |t|
4
4
  require 'rubyforge'
5
5
 
6
6
  v = ENV['VERSION'] or abort 'Must supply VERSION=x.y.z'
metadata CHANGED
@@ -1,13 +1,13 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: addressable
3
3
  version: !ruby/object:Gem::Version
4
- hash: 5
4
+ hash: 3
5
5
  prerelease: false
6
6
  segments:
7
7
  - 2
8
8
  - 2
9
- - 1
10
- version: 2.2.1
9
+ - 2
10
+ version: 2.2.2
11
11
  platform: ruby
12
12
  authors:
13
13
  - Bob Aman
@@ -15,7 +15,7 @@ autorequire:
15
15
  bindir: bin
16
16
  cert_chain: []
17
17
 
18
- date: 2010-08-19 00:00:00 -07:00
18
+ date: 2010-10-12 00:00:00 -07:00
19
19
  default_executable:
20
20
  dependencies:
21
21
  - !ruby/object:Gem::Dependency
@@ -102,7 +102,6 @@ files:
102
102
  - spec/addressable/idna_spec.rb
103
103
  - spec/addressable/template_spec.rb
104
104
  - spec/addressable/uri_spec.rb
105
- - spec/data/rfc3986.txt
106
105
  - tasks/clobber.rake
107
106
  - tasks/gem.rake
108
107
  - tasks/git.rake
@@ -1,3419 +0,0 @@
1
-
2
-
3
-
4
-
5
-
6
-
7
- Network Working Group T. Berners-Lee
8
- Request for Comments: 3986 W3C/MIT
9
- STD: 66 R. Fielding
10
- Updates: 1738 Day Software
11
- Obsoletes: 2732, 2396, 1808 L. Masinter
12
- Category: Standards Track Adobe Systems
13
- January 2005
14
-
15
-
16
- Uniform Resource Identifier (URI): Generic Syntax
17
-
18
- Status of This Memo
19
-
20
- This document specifies an Internet standards track protocol for the
21
- Internet community, and requests discussion and suggestions for
22
- improvements. Please refer to the current edition of the "Internet
23
- Official Protocol Standards" (STD 1) for the standardization state
24
- and status of this protocol. Distribution of this memo is unlimited.
25
-
26
- Copyright Notice
27
-
28
- Copyright (C) The Internet Society (2005).
29
-
30
- Abstract
31
-
32
- A Uniform Resource Identifier (URI) is a compact sequence of
33
- characters that identifies an abstract or physical resource. This
34
- specification defines the generic URI syntax and a process for
35
- resolving URI references that might be in relative form, along with
36
- guidelines and security considerations for the use of URIs on the
37
- Internet. The URI syntax defines a grammar that is a superset of all
38
- valid URIs, allowing an implementation to parse the common components
39
- of a URI reference without knowing the scheme-specific requirements
40
- of every possible identifier. This specification does not define a
41
- generative grammar for URIs; that task is performed by the individual
42
- specifications of each URI scheme.
43
-
44
-
45
-
46
-
47
-
48
-
49
-
50
-
51
-
52
-
53
-
54
-
55
-
56
-
57
-
58
- Berners-Lee, et al. Standards Track [Page 1]
59
-
60
- RFC 3986 URI Generic Syntax January 2005
61
-
62
-
63
- Table of Contents
64
-
65
- 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4
66
- 1.1. Overview of URIs . . . . . . . . . . . . . . . . . . . . 4
67
- 1.1.1. Generic Syntax . . . . . . . . . . . . . . . . . 6
68
- 1.1.2. Examples . . . . . . . . . . . . . . . . . . . . 7
69
- 1.1.3. URI, URL, and URN . . . . . . . . . . . . . . . 7
70
- 1.2. Design Considerations . . . . . . . . . . . . . . . . . 8
71
- 1.2.1. Transcription . . . . . . . . . . . . . . . . . 8
72
- 1.2.2. Separating Identification from Interaction . . . 9
73
- 1.2.3. Hierarchical Identifiers . . . . . . . . . . . . 10
74
- 1.3. Syntax Notation . . . . . . . . . . . . . . . . . . . . 11
75
- 2. Characters . . . . . . . . . . . . . . . . . . . . . . . . . . 11
76
- 2.1. Percent-Encoding . . . . . . . . . . . . . . . . . . . . 12
77
- 2.2. Reserved Characters . . . . . . . . . . . . . . . . . . 12
78
- 2.3. Unreserved Characters . . . . . . . . . . . . . . . . . 13
79
- 2.4. When to Encode or Decode . . . . . . . . . . . . . . . . 14
80
- 2.5. Identifying Data . . . . . . . . . . . . . . . . . . . . 14
81
- 3. Syntax Components . . . . . . . . . . . . . . . . . . . . . . 16
82
- 3.1. Scheme . . . . . . . . . . . . . . . . . . . . . . . . . 17
83
- 3.2. Authority . . . . . . . . . . . . . . . . . . . . . . . 17
84
- 3.2.1. User Information . . . . . . . . . . . . . . . . 18
85
- 3.2.2. Host . . . . . . . . . . . . . . . . . . . . . . 18
86
- 3.2.3. Port . . . . . . . . . . . . . . . . . . . . . . 22
87
- 3.3. Path . . . . . . . . . . . . . . . . . . . . . . . . . . 22
88
- 3.4. Query . . . . . . . . . . . . . . . . . . . . . . . . . 23
89
- 3.5. Fragment . . . . . . . . . . . . . . . . . . . . . . . . 24
90
- 4. Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
91
- 4.1. URI Reference . . . . . . . . . . . . . . . . . . . . . 25
92
- 4.2. Relative Reference . . . . . . . . . . . . . . . . . . . 26
93
- 4.3. Absolute URI . . . . . . . . . . . . . . . . . . . . . . 27
94
- 4.4. Same-Document Reference . . . . . . . . . . . . . . . . 27
95
- 4.5. Suffix Reference . . . . . . . . . . . . . . . . . . . . 27
96
- 5. Reference Resolution . . . . . . . . . . . . . . . . . . . . . 28
97
- 5.1. Establishing a Base URI . . . . . . . . . . . . . . . . 28
98
- 5.1.1. Base URI Embedded in Content . . . . . . . . . . 29
99
- 5.1.2. Base URI from the Encapsulating Entity . . . . . 29
100
- 5.1.3. Base URI from the Retrieval URI . . . . . . . . 30
101
- 5.1.4. Default Base URI . . . . . . . . . . . . . . . . 30
102
- 5.2. Relative Resolution . . . . . . . . . . . . . . . . . . 30
103
- 5.2.1. Pre-parse the Base URI . . . . . . . . . . . . . 31
104
- 5.2.2. Transform References . . . . . . . . . . . . . . 31
105
- 5.2.3. Merge Paths . . . . . . . . . . . . . . . . . . 32
106
- 5.2.4. Remove Dot Segments . . . . . . . . . . . . . . 33
107
- 5.3. Component Recomposition . . . . . . . . . . . . . . . . 35
108
- 5.4. Reference Resolution Examples . . . . . . . . . . . . . 35
109
- 5.4.1. Normal Examples . . . . . . . . . . . . . . . . 36
110
- 5.4.2. Abnormal Examples . . . . . . . . . . . . . . . 36
111
-
112
-
113
-
114
- Berners-Lee, et al. Standards Track [Page 2]
115
-
116
- RFC 3986 URI Generic Syntax January 2005
117
-
118
-
119
- 6. Normalization and Comparison . . . . . . . . . . . . . . . . . 38
120
- 6.1. Equivalence . . . . . . . . . . . . . . . . . . . . . . 38
121
- 6.2. Comparison Ladder . . . . . . . . . . . . . . . . . . . 39
122
- 6.2.1. Simple String Comparison . . . . . . . . . . . . 39
123
- 6.2.2. Syntax-Based Normalization . . . . . . . . . . . 40
124
- 6.2.3. Scheme-Based Normalization . . . . . . . . . . . 41
125
- 6.2.4. Protocol-Based Normalization . . . . . . . . . . 42
126
- 7. Security Considerations . . . . . . . . . . . . . . . . . . . 43
127
- 7.1. Reliability and Consistency . . . . . . . . . . . . . . 43
128
- 7.2. Malicious Construction . . . . . . . . . . . . . . . . . 43
129
- 7.3. Back-End Transcoding . . . . . . . . . . . . . . . . . . 44
130
- 7.4. Rare IP Address Formats . . . . . . . . . . . . . . . . 45
131
- 7.5. Sensitive Information . . . . . . . . . . . . . . . . . 45
132
- 7.6. Semantic Attacks . . . . . . . . . . . . . . . . . . . . 45
133
- 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 46
134
- 9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 46
135
- 10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 46
136
- 10.1. Normative References . . . . . . . . . . . . . . . . . . 46
137
- 10.2. Informative References . . . . . . . . . . . . . . . . . 47
138
- A. Collected ABNF for URI . . . . . . . . . . . . . . . . . . . . 49
139
- B. Parsing a URI Reference with a Regular Expression . . . . . . 50
140
- C. Delimiting a URI in Context . . . . . . . . . . . . . . . . . 51
141
- D. Changes from RFC 2396 . . . . . . . . . . . . . . . . . . . . 53
142
- D.1. Additions . . . . . . . . . . . . . . . . . . . . . . . 53
143
- D.2. Modifications . . . . . . . . . . . . . . . . . . . . . 53
144
- Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
145
- Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 60
146
- Full Copyright Statement . . . . . . . . . . . . . . . . . . . . . 61
147
-
148
-
149
-
150
-
151
-
152
-
153
-
154
-
155
-
156
-
157
-
158
-
159
-
160
-
161
-
162
-
163
-
164
-
165
-
166
-
167
-
168
-
169
-
170
- Berners-Lee, et al. Standards Track [Page 3]
171
-
172
- RFC 3986 URI Generic Syntax January 2005
173
-
174
-
175
- 1. Introduction
176
-
177
- A Uniform Resource Identifier (URI) provides a simple and extensible
178
- means for identifying a resource. This specification of URI syntax
179
- and semantics is derived from concepts introduced by the World Wide
180
- Web global information initiative, whose use of these identifiers
181
- dates from 1990 and is described in "Universal Resource Identifiers
182
- in WWW" [RFC1630]. The syntax is designed to meet the
183
- recommendations laid out in "Functional Recommendations for Internet
184
- Resource Locators" [RFC1736] and "Functional Requirements for Uniform
185
- Resource Names" [RFC1737].
186
-
187
- This document obsoletes [RFC2396], which merged "Uniform Resource
188
- Locators" [RFC1738] and "Relative Uniform Resource Locators"
189
- [RFC1808] in order to define a single, generic syntax for all URIs.
190
- It obsoletes [RFC2732], which introduced syntax for an IPv6 address.
191
- It excludes portions of RFC 1738 that defined the specific syntax of
192
- individual URI schemes; those portions will be updated as separate
193
- documents. The process for registration of new URI schemes is
194
- defined separately by [BCP35]. Advice for designers of new URI
195
- schemes can be found in [RFC2718]. All significant changes from RFC
196
- 2396 are noted in Appendix D.
197
-
198
- This specification uses the terms "character" and "coded character
199
- set" in accordance with the definitions provided in [BCP19], and
200
- "character encoding" in place of what [BCP19] refers to as a
201
- "charset".
202
-
203
- 1.1. Overview of URIs
204
-
205
- URIs are characterized as follows:
206
-
207
- Uniform
208
-
209
- Uniformity provides several benefits. It allows different types
210
- of resource identifiers to be used in the same context, even when
211
- the mechanisms used to access those resources may differ. It
212
- allows uniform semantic interpretation of common syntactic
213
- conventions across different types of resource identifiers. It
214
- allows introduction of new types of resource identifiers without
215
- interfering with the way that existing identifiers are used. It
216
- allows the identifiers to be reused in many different contexts,
217
- thus permitting new applications or protocols to leverage a pre-
218
- existing, large, and widely used set of resource identifiers.
219
-
220
-
221
-
222
-
223
-
224
-
225
-
226
- Berners-Lee, et al. Standards Track [Page 4]
227
-
228
- RFC 3986 URI Generic Syntax January 2005
229
-
230
-
231
- Resource
232
-
233
- This specification does not limit the scope of what might be a
234
- resource; rather, the term "resource" is used in a general sense
235
- for whatever might be identified by a URI. Familiar examples
236
- include an electronic document, an image, a source of information
237
- with a consistent purpose (e.g., "today's weather report for Los
238
- Angeles"), a service (e.g., an HTTP-to-SMS gateway), and a
239
- collection of other resources. A resource is not necessarily
240
- accessible via the Internet; e.g., human beings, corporations, and
241
- bound books in a library can also be resources. Likewise,
242
- abstract concepts can be resources, such as the operators and
243
- operands of a mathematical equation, the types of a relationship
244
- (e.g., "parent" or "employee"), or numeric values (e.g., zero,
245
- one, and infinity).
246
-
247
- Identifier
248
-
249
- An identifier embodies the information required to distinguish
250
- what is being identified from all other things within its scope of
251
- identification. Our use of the terms "identify" and "identifying"
252
- refer to this purpose of distinguishing one resource from all
253
- other resources, regardless of how that purpose is accomplished
254
- (e.g., by name, address, or context). These terms should not be
255
- mistaken as an assumption that an identifier defines or embodies
256
- the identity of what is referenced, though that may be the case
257
- for some identifiers. Nor should it be assumed that a system
258
- using URIs will access the resource identified: in many cases,
259
- URIs are used to denote resources without any intention that they
260
- be accessed. Likewise, the "one" resource identified might not be
261
- singular in nature (e.g., a resource might be a named set or a
262
- mapping that varies over time).
263
-
264
- A URI is an identifier consisting of a sequence of characters
265
- matching the syntax rule named <URI> in Section 3. It enables
266
- uniform identification of resources via a separately defined
267
- extensible set of naming schemes (Section 3.1). How that
268
- identification is accomplished, assigned, or enabled is delegated to
269
- each scheme specification.
270
-
271
- This specification does not place any limits on the nature of a
272
- resource, the reasons why an application might seek to refer to a
273
- resource, or the kinds of systems that might use URIs for the sake of
274
- identifying resources. This specification does not require that a
275
- URI persists in identifying the same resource over time, though that
276
- is a common goal of all URI schemes. Nevertheless, nothing in this
277
-
278
-
279
-
280
-
281
-
282
- Berners-Lee, et al. Standards Track [Page 5]
283
-
284
- RFC 3986 URI Generic Syntax January 2005
285
-
286
-
287
- specification prevents an application from limiting itself to
288
- particular types of resources, or to a subset of URIs that maintains
289
- characteristics desired by that application.
290
-
291
- URIs have a global scope and are interpreted consistently regardless
292
- of context, though the result of that interpretation may be in
293
- relation to the end-user's context. For example, "http://localhost/"
294
- has the same interpretation for every user of that reference, even
295
- though the network interface corresponding to "localhost" may be
296
- different for each end-user: interpretation is independent of access.
297
- However, an action made on the basis of that reference will take
298
- place in relation to the end-user's context, which implies that an
299
- action intended to refer to a globally unique thing must use a URI
300
- that distinguishes that resource from all other things. URIs that
301
- identify in relation to the end-user's local context should only be
302
- used when the context itself is a defining aspect of the resource,
303
- such as when an on-line help manual refers to a file on the end-
304
- user's file system (e.g., "file:///etc/hosts").
305
-
306
- 1.1.1. Generic Syntax
307
-
308
- Each URI begins with a scheme name, as defined in Section 3.1, that
309
- refers to a specification for assigning identifiers within that
310
- scheme. As such, the URI syntax is a federated and extensible naming
311
- system wherein each scheme's specification may further restrict the
312
- syntax and semantics of identifiers using that scheme.
313
-
314
- This specification defines those elements of the URI syntax that are
315
- required of all URI schemes or are common to many URI schemes. It
316
- thus defines the syntax and semantics needed to implement a scheme-
317
- independent parsing mechanism for URI references, by which the
318
- scheme-dependent handling of a URI can be postponed until the
319
- scheme-dependent semantics are needed. Likewise, protocols and data
320
- formats that make use of URI references can refer to this
321
- specification as a definition for the range of syntax allowed for all
322
- URIs, including those schemes that have yet to be defined. This
323
- decouples the evolution of identification schemes from the evolution
324
- of protocols, data formats, and implementations that make use of
325
- URIs.
326
-
327
- A parser of the generic URI syntax can parse any URI reference into
328
- its major components. Once the scheme is determined, further
329
- scheme-specific parsing can be performed on the components. In other
330
- words, the URI generic syntax is a superset of the syntax of all URI
331
- schemes.
332
-
333
-
334
-
335
-
336
-
337
-
338
- Berners-Lee, et al. Standards Track [Page 6]
339
-
340
- RFC 3986 URI Generic Syntax January 2005
341
-
342
-
343
- 1.1.2. Examples
344
-
345
- The following example URIs illustrate several URI schemes and
346
- variations in their common syntax components:
347
-
348
- ftp://ftp.is.co.za/rfc/rfc1808.txt
349
-
350
- http://www.ietf.org/rfc/rfc2396.txt
351
-
352
- ldap://[2001:db8::7]/c=GB?objectClass?one
353
-
354
- mailto:John.Doe@example.com
355
-
356
- news:comp.infosystems.www.servers.unix
357
-
358
- tel:+1-816-555-1212
359
-
360
- telnet://192.0.2.16:80/
361
-
362
- urn:oasis:names:specification:docbook:dtd:xml:4.1.2
363
-
364
-
365
- 1.1.3. URI, URL, and URN
366
-
367
- A URI can be further classified as a locator, a name, or both. The
368
- term "Uniform Resource Locator" (URL) refers to the subset of URIs
369
- that, in addition to identifying a resource, provide a means of
370
- locating the resource by describing its primary access mechanism
371
- (e.g., its network "location"). The term "Uniform Resource Name"
372
- (URN) has been used historically to refer to both URIs under the
373
- "urn" scheme [RFC2141], which are required to remain globally unique
374
- and persistent even when the resource ceases to exist or becomes
375
- unavailable, and to any other URI with the properties of a name.
376
-
377
- An individual scheme does not have to be classified as being just one
378
- of "name" or "locator". Instances of URIs from any given scheme may
379
- have the characteristics of names or locators or both, often
380
- depending on the persistence and care in the assignment of
381
- identifiers by the naming authority, rather than on any quality of
382
- the scheme. Future specifications and related documentation should
383
- use the general term "URI" rather than the more restrictive terms
384
- "URL" and "URN" [RFC3305].
385
-
386
-
387
-
388
-
389
-
390
-
391
-
392
-
393
-
394
- Berners-Lee, et al. Standards Track [Page 7]
395
-
396
- RFC 3986 URI Generic Syntax January 2005
397
-
398
-
399
- 1.2. Design Considerations
400
-
401
- 1.2.1. Transcription
402
-
403
- The URI syntax has been designed with global transcription as one of
404
- its main considerations. A URI is a sequence of characters from a
405
- very limited set: the letters of the basic Latin alphabet, digits,
406
- and a few special characters. A URI may be represented in a variety
407
- of ways; e.g., ink on paper, pixels on a screen, or a sequence of
408
- character encoding octets. The interpretation of a URI depends only
409
- on the characters used and not on how those characters are
410
- represented in a network protocol.
411
-
412
- The goal of transcription can be described by a simple scenario.
413
- Imagine two colleagues, Sam and Kim, sitting in a pub at an
414
- international conference and exchanging research ideas. Sam asks Kim
415
- for a location to get more information, so Kim writes the URI for the
416
- research site on a napkin. Upon returning home, Sam takes out the
417
- napkin and types the URI into a computer, which then retrieves the
418
- information to which Kim referred.
419
-
420
- There are several design considerations revealed by the scenario:
421
-
422
- o A URI is a sequence of characters that is not always represented
423
- as a sequence of octets.
424
-
425
- o A URI might be transcribed from a non-network source and thus
426
- should consist of characters that are most likely able to be
427
- entered into a computer, within the constraints imposed by
428
- keyboards (and related input devices) across languages and
429
- locales.
430
-
431
- o A URI often has to be remembered by people, and it is easier for
432
- people to remember a URI when it consists of meaningful or
433
- familiar components.
434
-
435
- These design considerations are not always in alignment. For
436
- example, it is often the case that the most meaningful name for a URI
437
- component would require characters that cannot be typed into some
438
- systems. The ability to transcribe a resource identifier from one
439
- medium to another has been considered more important than having a
440
- URI consist of the most meaningful of components.
441
-
442
- In local or regional contexts and with improving technology, users
443
- might benefit from being able to use a wider range of characters;
444
- such use is not defined by this specification. Percent-encoded
445
- octets (Section 2.1) may be used within a URI to represent characters
446
- outside the range of the US-ASCII coded character set if this
447
-
448
-
449
-
450
- Berners-Lee, et al. Standards Track [Page 8]
451
-
452
- RFC 3986 URI Generic Syntax January 2005
453
-
454
-
455
- representation is allowed by the scheme or by the protocol element in
456
- which the URI is referenced. Such a definition should specify the
457
- character encoding used to map those characters to octets prior to
458
- being percent-encoded for the URI.
459
-
460
- 1.2.2. Separating Identification from Interaction
461
-
462
- A common misunderstanding of URIs is that they are only used to refer
463
- to accessible resources. The URI itself only provides
464
- identification; access to the resource is neither guaranteed nor
465
- implied by the presence of a URI. Instead, any operation associated
466
- with a URI reference is defined by the protocol element, data format
467
- attribute, or natural language text in which it appears.
468
-
469
- Given a URI, a system may attempt to perform a variety of operations
470
- on the resource, as might be characterized by words such as "access",
471
- "update", "replace", or "find attributes". Such operations are
472
- defined by the protocols that make use of URIs, not by this
473
- specification. However, we do use a few general terms for describing
474
- common operations on URIs. URI "resolution" is the process of
475
- determining an access mechanism and the appropriate parameters
476
- necessary to dereference a URI; this resolution may require several
477
- iterations. To use that access mechanism to perform an action on the
478
- URI's resource is to "dereference" the URI.
479
-
480
- When URIs are used within information retrieval systems to identify
481
- sources of information, the most common form of URI dereference is
482
- "retrieval": making use of a URI in order to retrieve a
483
- representation of its associated resource. A "representation" is a
484
- sequence of octets, along with representation metadata describing
485
- those octets, that constitutes a record of the state of the resource
486
- at the time when the representation is generated. Retrieval is
487
- achieved by a process that might include using the URI as a cache key
488
- to check for a locally cached representation, resolution of the URI
489
- to determine an appropriate access mechanism (if any), and
490
- dereference of the URI for the sake of applying a retrieval
491
- operation. Depending on the protocols used to perform the retrieval,
492
- additional information might be supplied about the resource (resource
493
- metadata) and its relation to other resources.
494
-
495
- URI references in information retrieval systems are designed to be
496
- late-binding: the result of an access is generally determined when it
497
- is accessed and may vary over time or due to other aspects of the
498
- interaction. These references are created in order to be used in the
499
- future: what is being identified is not some specific result that was
500
- obtained in the past, but rather some characteristic that is expected
501
- to be true for future results. In such cases, the resource referred
502
- to by the URI is actually a sameness of characteristics as observed
503
-
504
-
505
-
506
- Berners-Lee, et al. Standards Track [Page 9]
507
-
508
- RFC 3986 URI Generic Syntax January 2005
509
-
510
-
511
- over time, perhaps elucidated by additional comments or assertions
512
- made by the resource provider.
513
-
514
- Although many URI schemes are named after protocols, this does not
515
- imply that use of these URIs will result in access to the resource
516
- via the named protocol. URIs are often used simply for the sake of
517
- identification. Even when a URI is used to retrieve a representation
518
- of a resource, that access might be through gateways, proxies,
519
- caches, and name resolution services that are independent of the
520
- protocol associated with the scheme name. The resolution of some
521
- URIs may require the use of more than one protocol (e.g., both DNS
522
- and HTTP are typically used to access an "http" URI's origin server
523
- when a representation isn't found in a local cache).
524
-
525
- 1.2.3. Hierarchical Identifiers
526
-
527
- The URI syntax is organized hierarchically, with components listed in
528
- order of decreasing significance from left to right. For some URI
529
- schemes, the visible hierarchy is limited to the scheme itself:
530
- everything after the scheme component delimiter (":") is considered
531
- opaque to URI processing. Other URI schemes make the hierarchy
532
- explicit and visible to generic parsing algorithms.
533
-
534
- The generic syntax uses the slash ("/"), question mark ("?"), and
535
- number sign ("#") characters to delimit components that are
536
- significant to the generic parser's hierarchical interpretation of an
537
- identifier. In addition to aiding the readability of such
538
- identifiers through the consistent use of familiar syntax, this
539
- uniform representation of hierarchy across naming schemes allows
540
- scheme-independent references to be made relative to that hierarchy.
541
-
542
- It is often the case that a group or "tree" of documents has been
543
- constructed to serve a common purpose, wherein the vast majority of
544
- URI references in these documents point to resources within the tree
545
- rather than outside it. Similarly, documents located at a particular
546
- site are much more likely to refer to other resources at that site
547
- than to resources at remote sites. Relative referencing of URIs
548
- allows document trees to be partially independent of their location
549
- and access scheme. For instance, it is possible for a single set of
550
- hypertext documents to be simultaneously accessible and traversable
551
- via each of the "file", "http", and "ftp" schemes if the documents
552
- refer to each other with relative references. Furthermore, such
553
- document trees can be moved, as a whole, without changing any of the
554
- relative references.
555
-
556
- A relative reference (Section 4.2) refers to a resource by describing
557
- the difference within a hierarchical name space between the reference
558
- context and the target URI. The reference resolution algorithm,
559
-
560
-
561
-
562
- Berners-Lee, et al. Standards Track [Page 10]
563
-
564
- RFC 3986 URI Generic Syntax January 2005
565
-
566
-
567
- presented in Section 5, defines how such a reference is transformed
568
- to the target URI. As relative references can only be used within
569
- the context of a hierarchical URI, designers of new URI schemes
570
- should use a syntax consistent with the generic syntax's hierarchical
571
- components unless there are compelling reasons to forbid relative
572
- referencing within that scheme.
573
-
574
- NOTE: Previous specifications used the terms "partial URI" and
575
- "relative URI" to denote a relative reference to a URI. As some
576
- readers misunderstood those terms to mean that relative URIs are a
577
- subset of URIs rather than a method of referencing URIs, this
578
- specification simply refers to them as relative references.
579
-
580
- All URI references are parsed by generic syntax parsers when used.
581
- However, because hierarchical processing has no effect on an absolute
582
- URI used in a reference unless it contains one or more dot-segments
583
- (complete path segments of "." or "..", as described in Section 3.3),
584
- URI scheme specifications can define opaque identifiers by
585
- disallowing use of slash characters, question mark characters, and
586
- the URIs "scheme:." and "scheme:..".
587
-
588
- 1.3. Syntax Notation
589
-
590
- This specification uses the Augmented Backus-Naur Form (ABNF)
591
- notation of [RFC2234], including the following core ABNF syntax rules
592
- defined by that specification: ALPHA (letters), CR (carriage return),
593
- DIGIT (decimal digits), DQUOTE (double quote), HEXDIG (hexadecimal
594
- digits), LF (line feed), and SP (space). The complete URI syntax is
595
- collected in Appendix A.
596
-
597
- 2. Characters
598
-
599
- The URI syntax provides a method of encoding data, presumably for the
600
- sake of identifying a resource, as a sequence of characters. The URI
601
- characters are, in turn, frequently encoded as octets for transport
602
- or presentation. This specification does not mandate any particular
603
- character encoding for mapping between URI characters and the octets
604
- used to store or transmit those characters. When a URI appears in a
605
- protocol element, the character encoding is defined by that protocol;
606
- without such a definition, a URI is assumed to be in the same
607
- character encoding as the surrounding text.
608
-
609
- The ABNF notation defines its terminal values to be non-negative
610
- integers (codepoints) based on the US-ASCII coded character set
611
- [ASCII]. Because a URI is a sequence of characters, we must invert
612
- that relation in order to understand the URI syntax. Therefore, the
613
-
614
-
615
-
616
-
617
-
618
- Berners-Lee, et al. Standards Track [Page 11]
619
-
620
- RFC 3986 URI Generic Syntax January 2005
621
-
622
-
623
- integer values used by the ABNF must be mapped back to their
624
- corresponding characters via US-ASCII in order to complete the syntax
625
- rules.
626
-
627
- A URI is composed from a limited set of characters consisting of
628
- digits, letters, and a few graphic symbols. A reserved subset of
629
- those characters may be used to delimit syntax components within a
630
- URI while the remaining characters, including both the unreserved set
631
- and those reserved characters not acting as delimiters, define each
632
- component's identifying data.
633
-
634
- 2.1. Percent-Encoding
635
-
636
- A percent-encoding mechanism is used to represent a data octet in a
637
- component when that octet's corresponding character is outside the
638
- allowed set or is being used as a delimiter of, or within, the
639
- component. A percent-encoded octet is encoded as a character
640
- triplet, consisting of the percent character "%" followed by the two
641
- hexadecimal digits representing that octet's numeric value. For
642
- example, "%20" is the percent-encoding for the binary octet
643
- "00100000" (ABNF: %x20), which in US-ASCII corresponds to the space
644
- character (SP). Section 2.4 describes when percent-encoding and
645
- decoding is applied.
646
-
647
- pct-encoded = "%" HEXDIG HEXDIG
648
-
649
- The uppercase hexadecimal digits 'A' through 'F' are equivalent to
650
- the lowercase digits 'a' through 'f', respectively. If two URIs
651
- differ only in the case of hexadecimal digits used in percent-encoded
652
- octets, they are equivalent. For consistency, URI producers and
653
- normalizers should use uppercase hexadecimal digits for all percent-
654
- encodings.
655
-
656
- 2.2. Reserved Characters
657
-
658
- URIs include components and subcomponents that are delimited by
659
- characters in the "reserved" set. These characters are called
660
- "reserved" because they may (or may not) be defined as delimiters by
661
- the generic syntax, by each scheme-specific syntax, or by the
662
- implementation-specific syntax of a URI's dereferencing algorithm.
663
- If data for a URI component would conflict with a reserved
664
- character's purpose as a delimiter, then the conflicting data must be
665
- percent-encoded before the URI is formed.
666
-
667
-
668
-
669
-
670
-
671
-
672
-
673
-
674
- Berners-Lee, et al. Standards Track [Page 12]
675
-
676
- RFC 3986 URI Generic Syntax January 2005
677
-
678
-
679
- reserved = gen-delims / sub-delims
680
-
681
- gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
682
-
683
- sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
684
- / "*" / "+" / "," / ";" / "="
685
-
686
- The purpose of reserved characters is to provide a set of delimiting
687
- characters that are distinguishable from other data within a URI.
688
- URIs that differ in the replacement of a reserved character with its
689
- corresponding percent-encoded octet are not equivalent. Percent-
690
- encoding a reserved character, or decoding a percent-encoded octet
691
- that corresponds to a reserved character, will change how the URI is
692
- interpreted by most applications. Thus, characters in the reserved
693
- set are protected from normalization and are therefore safe to be
694
- used by scheme-specific and producer-specific algorithms for
695
- delimiting data subcomponents within a URI.
696
-
697
- A subset of the reserved characters (gen-delims) is used as
698
- delimiters of the generic URI components described in Section 3. A
699
- component's ABNF syntax rule will not use the reserved or gen-delims
700
- rule names directly; instead, each syntax rule lists the characters
701
- allowed within that component (i.e., not delimiting it), and any of
702
- those characters that are also in the reserved set are "reserved" for
703
- use as subcomponent delimiters within the component. Only the most
704
- common subcomponents are defined by this specification; other
705
- subcomponents may be defined by a URI scheme's specification, or by
706
- the implementation-specific syntax of a URI's dereferencing
707
- algorithm, provided that such subcomponents are delimited by
708
- characters in the reserved set allowed within that component.
709
-
710
- URI producing applications should percent-encode data octets that
711
- correspond to characters in the reserved set unless these characters
712
- are specifically allowed by the URI scheme to represent data in that
713
- component. If a reserved character is found in a URI component and
714
- no delimiting role is known for that character, then it must be
715
- interpreted as representing the data octet corresponding to that
716
- character's encoding in US-ASCII.
717
-
718
- 2.3. Unreserved Characters
719
-
720
- Characters that are allowed in a URI but do not have a reserved
721
- purpose are called unreserved. These include uppercase and lowercase
722
- letters, decimal digits, hyphen, period, underscore, and tilde.
723
-
724
- unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
725
-
726
-
727
-
728
-
729
-
730
- Berners-Lee, et al. Standards Track [Page 13]
731
-
732
- RFC 3986 URI Generic Syntax January 2005
733
-
734
-
735
- URIs that differ in the replacement of an unreserved character with
736
- its corresponding percent-encoded US-ASCII octet are equivalent: they
737
- identify the same resource. However, URI comparison implementations
738
- do not always perform normalization prior to comparison (see Section
739
- 6). For consistency, percent-encoded octets in the ranges of ALPHA
740
- (%41-%5A and %61-%7A), DIGIT (%30-%39), hyphen (%2D), period (%2E),
741
- underscore (%5F), or tilde (%7E) should not be created by URI
742
- producers and, when found in a URI, should be decoded to their
743
- corresponding unreserved characters by URI normalizers.
744
-
745
- 2.4. When to Encode or Decode
746
-
747
- Under normal circumstances, the only time when octets within a URI
748
- are percent-encoded is during the process of producing the URI from
749
- its component parts. This is when an implementation determines which
750
- of the reserved characters are to be used as subcomponent delimiters
751
- and which can be safely used as data. Once produced, a URI is always
752
- in its percent-encoded form.
753
-
754
- When a URI is dereferenced, the components and subcomponents
755
- significant to the scheme-specific dereferencing process (if any)
756
- must be parsed and separated before the percent-encoded octets within
757
- those components can be safely decoded, as otherwise the data may be
758
- mistaken for component delimiters. The only exception is for
759
- percent-encoded octets corresponding to characters in the unreserved
760
- set, which can be decoded at any time. For example, the octet
761
- corresponding to the tilde ("~") character is often encoded as "%7E"
762
- by older URI processing implementations; the "%7E" can be replaced by
763
- "~" without changing its interpretation.
764
-
765
- Because the percent ("%") character serves as the indicator for
766
- percent-encoded octets, it must be percent-encoded as "%25" for that
767
- octet to be used as data within a URI. Implementations must not
768
- percent-encode or decode the same string more than once, as decoding
769
- an already decoded string might lead to misinterpreting a percent
770
- data octet as the beginning of a percent-encoding, or vice versa in
771
- the case of percent-encoding an already percent-encoded string.
772
-
773
- 2.5. Identifying Data
774
-
775
- URI characters provide identifying data for each of the URI
776
- components, serving as an external interface for identification
777
- between systems. Although the presence and nature of the URI
778
- production interface is hidden from clients that use its URIs (and is
779
- thus beyond the scope of the interoperability requirements defined by
780
- this specification), it is a frequent source of confusion and errors
781
- in the interpretation of URI character issues. Implementers have to
782
- be aware that there are multiple character encodings involved in the
783
-
784
-
785
-
786
- Berners-Lee, et al. Standards Track [Page 14]
787
-
788
- RFC 3986 URI Generic Syntax January 2005
789
-
790
-
791
- production and transmission of URIs: local name and data encoding,
792
- public interface encoding, URI character encoding, data format
793
- encoding, and protocol encoding.
794
-
795
- Local names, such as file system names, are stored with a local
796
- character encoding. URI producing applications (e.g., origin
797
- servers) will typically use the local encoding as the basis for
798
- producing meaningful names. The URI producer will transform the
799
- local encoding to one that is suitable for a public interface and
800
- then transform the public interface encoding into the restricted set
801
- of URI characters (reserved, unreserved, and percent-encodings).
802
- Those characters are, in turn, encoded as octets to be used as a
803
- reference within a data format (e.g., a document charset), and such
804
- data formats are often subsequently encoded for transmission over
805
- Internet protocols.
806
-
807
- For most systems, an unreserved character appearing within a URI
808
- component is interpreted as representing the data octet corresponding
809
- to that character's encoding in US-ASCII. Consumers of URIs assume
810
- that the letter "X" corresponds to the octet "01011000", and even
811
- when that assumption is incorrect, there is no harm in making it. A
812
- system that internally provides identifiers in the form of a
813
- different character encoding, such as EBCDIC, will generally perform
814
- character translation of textual identifiers to UTF-8 [STD63] (or
815
- some other superset of the US-ASCII character encoding) at an
816
- internal interface, thereby providing more meaningful identifiers
817
- than those resulting from simply percent-encoding the original
818
- octets.
819
-
820
- For example, consider an information service that provides data,
821
- stored locally using an EBCDIC-based file system, to clients on the
822
- Internet through an HTTP server. When an author creates a file with
823
- the name "Laguna Beach" on that file system, the "http" URI
824
- corresponding to that resource is expected to contain the meaningful
825
- string "Laguna%20Beach". If, however, that server produces URIs by
826
- using an overly simplistic raw octet mapping, then the result would
827
- be a URI containing "%D3%81%87%A4%95%81@%C2%85%81%83%88". An
828
- internal transcoding interface fixes this problem by transcoding the
829
- local name to a superset of US-ASCII prior to producing the URI.
830
- Naturally, proper interpretation of an incoming URI on such an
831
- interface requires that percent-encoded octets be decoded (e.g.,
832
- "%20" to SP) before the reverse transcoding is applied to obtain the
833
- local name.
834
-
835
- In some cases, the internal interface between a URI component and the
836
- identifying data that it has been crafted to represent is much less
837
- direct than a character encoding translation. For example, portions
838
- of a URI might reflect a query on non-ASCII data, or numeric
839
-
840
-
841
-
842
- Berners-Lee, et al. Standards Track [Page 15]
843
-
844
- RFC 3986 URI Generic Syntax January 2005
845
-
846
-
847
- coordinates on a map. Likewise, a URI scheme may define components
848
- with additional encoding requirements that are applied prior to
849
- forming the component and producing the URI.
850
-
851
- When a new URI scheme defines a component that represents textual
852
- data consisting of characters from the Universal Character Set [UCS],
853
- the data should first be encoded as octets according to the UTF-8
854
- character encoding [STD63]; then only those octets that do not
855
- correspond to characters in the unreserved set should be percent-
856
- encoded. For example, the character A would be represented as "A",
857
- the character LATIN CAPITAL LETTER A WITH GRAVE would be represented
858
- as "%C3%80", and the character KATAKANA LETTER A would be represented
859
- as "%E3%82%A2".
860
-
861
- 3. Syntax Components
862
-
863
- The generic URI syntax consists of a hierarchical sequence of
864
- components referred to as the scheme, authority, path, query, and
865
- fragment.
866
-
867
- URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
868
-
869
- hier-part = "//" authority path-abempty
870
- / path-absolute
871
- / path-rootless
872
- / path-empty
873
-
874
- The scheme and path components are required, though the path may be
875
- empty (no characters). When authority is present, the path must
876
- either be empty or begin with a slash ("/") character. When
877
- authority is not present, the path cannot begin with two slash
878
- characters ("//"). These restrictions result in five different ABNF
879
- rules for a path (Section 3.3), only one of which will match any
880
- given URI reference.
881
-
882
- The following are two example URIs and their component parts:
883
-
884
- foo://example.com:8042/over/there?name=ferret#nose
885
- \_/ \______________/\_________/ \_________/ \__/
886
- | | | | |
887
- scheme authority path query fragment
888
- | _____________________|__
889
- / \ / \
890
- urn:example:animal:ferret:nose
891
-
892
-
893
-
894
-
895
-
896
-
897
-
898
- Berners-Lee, et al. Standards Track [Page 16]
899
-
900
- RFC 3986 URI Generic Syntax January 2005
901
-
902
-
903
- 3.1. Scheme
904
-
905
- Each URI begins with a scheme name that refers to a specification for
906
- assigning identifiers within that scheme. As such, the URI syntax is
907
- a federated and extensible naming system wherein each scheme's
908
- specification may further restrict the syntax and semantics of
909
- identifiers using that scheme.
910
-
911
- Scheme names consist of a sequence of characters beginning with a
912
- letter and followed by any combination of letters, digits, plus
913
- ("+"), period ("."), or hyphen ("-"). Although schemes are case-
914
- insensitive, the canonical form is lowercase and documents that
915
- specify schemes must do so with lowercase letters. An implementation
916
- should accept uppercase letters as equivalent to lowercase in scheme
917
- names (e.g., allow "HTTP" as well as "http") for the sake of
918
- robustness but should only produce lowercase scheme names for
919
- consistency.
920
-
921
- scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
922
-
923
- Individual schemes are not specified by this document. The process
924
- for registration of new URI schemes is defined separately by [BCP35].
925
- The scheme registry maintains the mapping between scheme names and
926
- their specifications. Advice for designers of new URI schemes can be
927
- found in [RFC2718]. URI scheme specifications must define their own
928
- syntax so that all strings matching their scheme-specific syntax will
929
- also match the <absolute-URI> grammar, as described in Section 4.3.
930
-
931
- When presented with a URI that violates one or more scheme-specific
932
- restrictions, the scheme-specific resolution process should flag the
933
- reference as an error rather than ignore the unused parts; doing so
934
- reduces the number of equivalent URIs and helps detect abuses of the
935
- generic syntax, which might indicate that the URI has been
936
- constructed to mislead the user (Section 7.6).
937
-
938
- 3.2. Authority
939
-
940
- Many URI schemes include a hierarchical element for a naming
941
- authority so that governance of the name space defined by the
942
- remainder of the URI is delegated to that authority (which may, in
943
- turn, delegate it further). The generic syntax provides a common
944
- means for distinguishing an authority based on a registered name or
945
- server address, along with optional port and user information.
946
-
947
- The authority component is preceded by a double slash ("//") and is
948
- terminated by the next slash ("/"), question mark ("?"), or number
949
- sign ("#") character, or by the end of the URI.
950
-
951
-
952
-
953
-
954
- Berners-Lee, et al. Standards Track [Page 17]
955
-
956
- RFC 3986 URI Generic Syntax January 2005
957
-
958
-
959
- authority = [ userinfo "@" ] host [ ":" port ]
960
-
961
- URI producers and normalizers should omit the ":" delimiter that
962
- separates host from port if the port component is empty. Some
963
- schemes do not allow the userinfo and/or port subcomponents.
964
-
965
- If a URI contains an authority component, then the path component
966
- must either be empty or begin with a slash ("/") character. Non-
967
- validating parsers (those that merely separate a URI reference into
968
- its major components) will often ignore the subcomponent structure of
969
- authority, treating it as an opaque string from the double-slash to
970
- the first terminating delimiter, until such time as the URI is
971
- dereferenced.
972
-
973
- 3.2.1. User Information
974
-
975
- The userinfo subcomponent may consist of a user name and, optionally,
976
- scheme-specific information about how to gain authorization to access
977
- the resource. The user information, if present, is followed by a
978
- commercial at-sign ("@") that delimits it from the host.
979
-
980
- userinfo = *( unreserved / pct-encoded / sub-delims / ":" )
981
-
982
- Use of the format "user:password" in the userinfo field is
983
- deprecated. Applications should not render as clear text any data
984
- after the first colon (":") character found within a userinfo
985
- subcomponent unless the data after the colon is the empty string
986
- (indicating no password). Applications may choose to ignore or
987
- reject such data when it is received as part of a reference and
988
- should reject the storage of such data in unencrypted form. The
989
- passing of authentication information in clear text has proven to be
990
- a security risk in almost every case where it has been used.
991
-
992
- Applications that render a URI for the sake of user feedback, such as
993
- in graphical hypertext browsing, should render userinfo in a way that
994
- is distinguished from the rest of a URI, when feasible. Such
995
- rendering will assist the user in cases where the userinfo has been
996
- misleadingly crafted to look like a trusted domain name
997
- (Section 7.6).
998
-
999
- 3.2.2. Host
1000
-
1001
- The host subcomponent of authority is identified by an IP literal
1002
- encapsulated within square brackets, an IPv4 address in dotted-
1003
- decimal form, or a registered name. The host subcomponent is case-
1004
- insensitive. The presence of a host subcomponent within a URI does
1005
- not imply that the scheme requires access to the given host on the
1006
- Internet. In many cases, the host syntax is used only for the sake
1007
-
1008
-
1009
-
1010
- Berners-Lee, et al. Standards Track [Page 18]
1011
-
1012
- RFC 3986 URI Generic Syntax January 2005
1013
-
1014
-
1015
- of reusing the existing registration process created and deployed for
1016
- DNS, thus obtaining a globally unique name without the cost of
1017
- deploying another registry. However, such use comes with its own
1018
- costs: domain name ownership may change over time for reasons not
1019
- anticipated by the URI producer. In other cases, the data within the
1020
- host component identifies a registered name that has nothing to do
1021
- with an Internet host. We use the name "host" for the ABNF rule
1022
- because that is its most common purpose, not its only purpose.
1023
-
1024
- host = IP-literal / IPv4address / reg-name
1025
-
1026
- The syntax rule for host is ambiguous because it does not completely
1027
- distinguish between an IPv4address and a reg-name. In order to
1028
- disambiguate the syntax, we apply the "first-match-wins" algorithm:
1029
- If host matches the rule for IPv4address, then it should be
1030
- considered an IPv4 address literal and not a reg-name. Although host
1031
- is case-insensitive, producers and normalizers should use lowercase
1032
- for registered names and hexadecimal addresses for the sake of
1033
- uniformity, while only using uppercase letters for percent-encodings.
1034
-
1035
- A host identified by an Internet Protocol literal address, version 6
1036
- [RFC3513] or later, is distinguished by enclosing the IP literal
1037
- within square brackets ("[" and "]"). This is the only place where
1038
- square bracket characters are allowed in the URI syntax. In
1039
- anticipation of future, as-yet-undefined IP literal address formats,
1040
- an implementation may use an optional version flag to indicate such a
1041
- format explicitly rather than rely on heuristic determination.
1042
-
1043
- IP-literal = "[" ( IPv6address / IPvFuture ) "]"
1044
-
1045
- IPvFuture = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )
1046
-
1047
- The version flag does not indicate the IP version; rather, it
1048
- indicates future versions of the literal format. As such,
1049
- implementations must not provide the version flag for the existing
1050
- IPv4 and IPv6 literal address forms described below. If a URI
1051
- containing an IP-literal that starts with "v" (case-insensitive),
1052
- indicating that the version flag is present, is dereferenced by an
1053
- application that does not know the meaning of that version flag, then
1054
- the application should return an appropriate error for "address
1055
- mechanism not supported".
1056
-
1057
- A host identified by an IPv6 literal address is represented inside
1058
- the square brackets without a preceding version flag. The ABNF
1059
- provided here is a translation of the text definition of an IPv6
1060
- literal address provided in [RFC3513]. This syntax does not support
1061
- IPv6 scoped addressing zone identifiers.
1062
-
1063
-
1064
-
1065
-
1066
- Berners-Lee, et al. Standards Track [Page 19]
1067
-
1068
- RFC 3986 URI Generic Syntax January 2005
1069
-
1070
-
1071
- A 128-bit IPv6 address is divided into eight 16-bit pieces. Each
1072
- piece is represented numerically in case-insensitive hexadecimal,
1073
- using one to four hexadecimal digits (leading zeroes are permitted).
1074
- The eight encoded pieces are given most-significant first, separated
1075
- by colon characters. Optionally, the least-significant two pieces
1076
- may instead be represented in IPv4 address textual format. A
1077
- sequence of one or more consecutive zero-valued 16-bit pieces within
1078
- the address may be elided, omitting all their digits and leaving
1079
- exactly two consecutive colons in their place to mark the elision.
1080
-
1081
- IPv6address = 6( h16 ":" ) ls32
1082
- / "::" 5( h16 ":" ) ls32
1083
- / [ h16 ] "::" 4( h16 ":" ) ls32
1084
- / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
1085
- / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
1086
- / [ *3( h16 ":" ) h16 ] "::" h16 ":" ls32
1087
- / [ *4( h16 ":" ) h16 ] "::" ls32
1088
- / [ *5( h16 ":" ) h16 ] "::" h16
1089
- / [ *6( h16 ":" ) h16 ] "::"
1090
-
1091
- ls32 = ( h16 ":" h16 ) / IPv4address
1092
- ; least-significant 32 bits of address
1093
-
1094
- h16 = 1*4HEXDIG
1095
- ; 16 bits of address represented in hexadecimal
1096
-
1097
- A host identified by an IPv4 literal address is represented in
1098
- dotted-decimal notation (a sequence of four decimal numbers in the
1099
- range 0 to 255, separated by "."), as described in [RFC1123] by
1100
- reference to [RFC0952]. Note that other forms of dotted notation may
1101
- be interpreted on some platforms, as described in Section 7.4, but
1102
- only the dotted-decimal form of four octets is allowed by this
1103
- grammar.
1104
-
1105
- IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet
1106
-
1107
- dec-octet = DIGIT ; 0-9
1108
- / %x31-39 DIGIT ; 10-99
1109
- / "1" 2DIGIT ; 100-199
1110
- / "2" %x30-34 DIGIT ; 200-249
1111
- / "25" %x30-35 ; 250-255
1112
-
1113
- A host identified by a registered name is a sequence of characters
1114
- usually intended for lookup within a locally defined host or service
1115
- name registry, though the URI's scheme-specific semantics may require
1116
- that a specific registry (or fixed name table) be used instead. The
1117
- most common name registry mechanism is the Domain Name System (DNS).
1118
- A registered name intended for lookup in the DNS uses the syntax
1119
-
1120
-
1121
-
1122
- Berners-Lee, et al. Standards Track [Page 20]
1123
-
1124
- RFC 3986 URI Generic Syntax January 2005
1125
-
1126
-
1127
- defined in Section 3.5 of [RFC1034] and Section 2.1 of [RFC1123].
1128
- Such a name consists of a sequence of domain labels separated by ".",
1129
- each domain label starting and ending with an alphanumeric character
1130
- and possibly also containing "-" characters. The rightmost domain
1131
- label of a fully qualified domain name in DNS may be followed by a
1132
- single "." and should be if it is necessary to distinguish between
1133
- the complete domain name and some local domain.
1134
-
1135
- reg-name = *( unreserved / pct-encoded / sub-delims )
1136
-
1137
- If the URI scheme defines a default for host, then that default
1138
- applies when the host subcomponent is undefined or when the
1139
- registered name is empty (zero length). For example, the "file" URI
1140
- scheme is defined so that no authority, an empty host, and
1141
- "localhost" all mean the end-user's machine, whereas the "http"
1142
- scheme considers a missing authority or empty host invalid.
1143
-
1144
- This specification does not mandate a particular registered name
1145
- lookup technology and therefore does not restrict the syntax of reg-
1146
- name beyond what is necessary for interoperability. Instead, it
1147
- delegates the issue of registered name syntax conformance to the
1148
- operating system of each application performing URI resolution, and
1149
- that operating system decides what it will allow for the purpose of
1150
- host identification. A URI resolution implementation might use DNS,
1151
- host tables, yellow pages, NetInfo, WINS, or any other system for
1152
- lookup of registered names. However, a globally scoped naming
1153
- system, such as DNS fully qualified domain names, is necessary for
1154
- URIs intended to have global scope. URI producers should use names
1155
- that conform to the DNS syntax, even when use of DNS is not
1156
- immediately apparent, and should limit these names to no more than
1157
- 255 characters in length.
1158
-
1159
- The reg-name syntax allows percent-encoded octets in order to
1160
- represent non-ASCII registered names in a uniform way that is
1161
- independent of the underlying name resolution technology. Non-ASCII
1162
- characters must first be encoded according to UTF-8 [STD63], and then
1163
- each octet of the corresponding UTF-8 sequence must be percent-
1164
- encoded to be represented as URI characters. URI producing
1165
- applications must not use percent-encoding in host unless it is used
1166
- to represent a UTF-8 character sequence. When a non-ASCII registered
1167
- name represents an internationalized domain name intended for
1168
- resolution via the DNS, the name must be transformed to the IDNA
1169
- encoding [RFC3490] prior to name lookup. URI producers should
1170
- provide these registered names in the IDNA encoding, rather than a
1171
- percent-encoding, if they wish to maximize interoperability with
1172
- legacy URI resolvers.
1173
-
1174
-
1175
-
1176
-
1177
-
1178
- Berners-Lee, et al. Standards Track [Page 21]
1179
-
1180
- RFC 3986 URI Generic Syntax January 2005
1181
-
1182
-
1183
- 3.2.3. Port
1184
-
1185
- The port subcomponent of authority is designated by an optional port
1186
- number in decimal following the host and delimited from it by a
1187
- single colon (":") character.
1188
-
1189
- port = *DIGIT
1190
-
1191
- A scheme may define a default port. For example, the "http" scheme
1192
- defines a default port of "80", corresponding to its reserved TCP
1193
- port number. The type of port designated by the port number (e.g.,
1194
- TCP, UDP, SCTP) is defined by the URI scheme. URI producers and
1195
- normalizers should omit the port component and its ":" delimiter if
1196
- port is empty or if its value would be the same as that of the
1197
- scheme's default.
1198
-
1199
- 3.3. Path
1200
-
1201
- The path component contains data, usually organized in hierarchical
1202
- form, that, along with data in the non-hierarchical query component
1203
- (Section 3.4), serves to identify a resource within the scope of the
1204
- URI's scheme and naming authority (if any). The path is terminated
1205
- by the first question mark ("?") or number sign ("#") character, or
1206
- by the end of the URI.
1207
-
1208
- If a URI contains an authority component, then the path component
1209
- must either be empty or begin with a slash ("/") character. If a URI
1210
- does not contain an authority component, then the path cannot begin
1211
- with two slash characters ("//"). In addition, a URI reference
1212
- (Section 4.1) may be a relative-path reference, in which case the
1213
- first path segment cannot contain a colon (":") character. The ABNF
1214
- requires five separate rules to disambiguate these cases, only one of
1215
- which will match the path substring within a given URI reference. We
1216
- use the generic term "path component" to describe the URI substring
1217
- matched by the parser to one of these rules.
1218
-
1219
- path = path-abempty ; begins with "/" or is empty
1220
- / path-absolute ; begins with "/" but not "//"
1221
- / path-noscheme ; begins with a non-colon segment
1222
- / path-rootless ; begins with a segment
1223
- / path-empty ; zero characters
1224
-
1225
- path-abempty = *( "/" segment )
1226
- path-absolute = "/" [ segment-nz *( "/" segment ) ]
1227
- path-noscheme = segment-nz-nc *( "/" segment )
1228
- path-rootless = segment-nz *( "/" segment )
1229
- path-empty = 0<pchar>
1230
-
1231
-
1232
-
1233
-
1234
- Berners-Lee, et al. Standards Track [Page 22]
1235
-
1236
- RFC 3986 URI Generic Syntax January 2005
1237
-
1238
-
1239
- segment = *pchar
1240
- segment-nz = 1*pchar
1241
- segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
1242
- ; non-zero-length segment without any colon ":"
1243
-
1244
- pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
1245
-
1246
- A path consists of a sequence of path segments separated by a slash
1247
- ("/") character. A path is always defined for a URI, though the
1248
- defined path may be empty (zero length). Use of the slash character
1249
- to indicate hierarchy is only required when a URI will be used as the
1250
- context for relative references. For example, the URI
1251
- <mailto:fred@example.com> has a path of "fred@example.com", whereas
1252
- the URI <foo://info.example.com?fred> has an empty path.
1253
-
1254
- The path segments "." and "..", also known as dot-segments, are
1255
- defined for relative reference within the path name hierarchy. They
1256
- are intended for use at the beginning of a relative-path reference
1257
- (Section 4.2) to indicate relative position within the hierarchical
1258
- tree of names. This is similar to their role within some operating
1259
- systems' file directory structures to indicate the current directory
1260
- and parent directory, respectively. However, unlike in a file
1261
- system, these dot-segments are only interpreted within the URI path
1262
- hierarchy and are removed as part of the resolution process (Section
1263
- 5.2).
1264
-
1265
- Aside from dot-segments in hierarchical paths, a path segment is
1266
- considered opaque by the generic syntax. URI producing applications
1267
- often use the reserved characters allowed in a segment to delimit
1268
- scheme-specific or dereference-handler-specific subcomponents. For
1269
- example, the semicolon (";") and equals ("=") reserved characters are
1270
- often used to delimit parameters and parameter values applicable to
1271
- that segment. The comma (",") reserved character is often used for
1272
- similar purposes. For example, one URI producer might use a segment
1273
- such as "name;v=1.1" to indicate a reference to version 1.1 of
1274
- "name", whereas another might use a segment such as "name,1.1" to
1275
- indicate the same. Parameter types may be defined by scheme-specific
1276
- semantics, but in most cases the syntax of a parameter is specific to
1277
- the implementation of the URI's dereferencing algorithm.
1278
-
1279
- 3.4. Query
1280
-
1281
- The query component contains non-hierarchical data that, along with
1282
- data in the path component (Section 3.3), serves to identify a
1283
- resource within the scope of the URI's scheme and naming authority
1284
- (if any). The query component is indicated by the first question
1285
- mark ("?") character and terminated by a number sign ("#") character
1286
- or by the end of the URI.
1287
-
1288
-
1289
-
1290
- Berners-Lee, et al. Standards Track [Page 23]
1291
-
1292
- RFC 3986 URI Generic Syntax January 2005
1293
-
1294
-
1295
- query = *( pchar / "/" / "?" )
1296
-
1297
- The characters slash ("/") and question mark ("?") may represent data
1298
- within the query component. Beware that some older, erroneous
1299
- implementations may not handle such data correctly when it is used as
1300
- the base URI for relative references (Section 5.1), apparently
1301
- because they fail to distinguish query data from path data when
1302
- looking for hierarchical separators. However, as query components
1303
- are often used to carry identifying information in the form of
1304
- "key=value" pairs and one frequently used value is a reference to
1305
- another URI, it is sometimes better for usability to avoid percent-
1306
- encoding those characters.
1307
-
1308
- 3.5. Fragment
1309
-
1310
- The fragment identifier component of a URI allows indirect
1311
- identification of a secondary resource by reference to a primary
1312
- resource and additional identifying information. The identified
1313
- secondary resource may be some portion or subset of the primary
1314
- resource, some view on representations of the primary resource, or
1315
- some other resource defined or described by those representations. A
1316
- fragment identifier component is indicated by the presence of a
1317
- number sign ("#") character and terminated by the end of the URI.
1318
-
1319
- fragment = *( pchar / "/" / "?" )
1320
-
1321
- The semantics of a fragment identifier are defined by the set of
1322
- representations that might result from a retrieval action on the
1323
- primary resource. The fragment's format and resolution is therefore
1324
- dependent on the media type [RFC2046] of a potentially retrieved
1325
- representation, even though such a retrieval is only performed if the
1326
- URI is dereferenced. If no such representation exists, then the
1327
- semantics of the fragment are considered unknown and are effectively
1328
- unconstrained. Fragment identifier semantics are independent of the
1329
- URI scheme and thus cannot be redefined by scheme specifications.
1330
-
1331
- Individual media types may define their own restrictions on or
1332
- structures within the fragment identifier syntax for specifying
1333
- different types of subsets, views, or external references that are
1334
- identifiable as secondary resources by that media type. If the
1335
- primary resource has multiple representations, as is often the case
1336
- for resources whose representation is selected based on attributes of
1337
- the retrieval request (a.k.a., content negotiation), then whatever is
1338
- identified by the fragment should be consistent across all of those
1339
- representations. Each representation should either define the
1340
- fragment so that it corresponds to the same secondary resource,
1341
- regardless of how it is represented, or should leave the fragment
1342
- undefined (i.e., not found).
1343
-
1344
-
1345
-
1346
- Berners-Lee, et al. Standards Track [Page 24]
1347
-
1348
- RFC 3986 URI Generic Syntax January 2005
1349
-
1350
-
1351
- As with any URI, use of a fragment identifier component does not
1352
- imply that a retrieval action will take place. A URI with a fragment
1353
- identifier may be used to refer to the secondary resource without any
1354
- implication that the primary resource is accessible or will ever be
1355
- accessed.
1356
-
1357
- Fragment identifiers have a special role in information retrieval
1358
- systems as the primary form of client-side indirect referencing,
1359
- allowing an author to specifically identify aspects of an existing
1360
- resource that are only indirectly provided by the resource owner. As
1361
- such, the fragment identifier is not used in the scheme-specific
1362
- processing of a URI; instead, the fragment identifier is separated
1363
- from the rest of the URI prior to a dereference, and thus the
1364
- identifying information within the fragment itself is dereferenced
1365
- solely by the user agent, regardless of the URI scheme. Although
1366
- this separate handling is often perceived to be a loss of
1367
- information, particularly for accurate redirection of references as
1368
- resources move over time, it also serves to prevent information
1369
- providers from denying reference authors the right to refer to
1370
- information within a resource selectively. Indirect referencing also
1371
- provides additional flexibility and extensibility to systems that use
1372
- URIs, as new media types are easier to define and deploy than new
1373
- schemes of identification.
1374
-
1375
- The characters slash ("/") and question mark ("?") are allowed to
1376
- represent data within the fragment identifier. Beware that some
1377
- older, erroneous implementations may not handle this data correctly
1378
- when it is used as the base URI for relative references (Section
1379
- 5.1).
1380
-
1381
- 4. Usage
1382
-
1383
- When applications make reference to a URI, they do not always use the
1384
- full form of reference defined by the "URI" syntax rule. To save
1385
- space and take advantage of hierarchical locality, many Internet
1386
- protocol elements and media type formats allow an abbreviation of a
1387
- URI, whereas others restrict the syntax to a particular form of URI.
1388
- We define the most common forms of reference syntax in this
1389
- specification because they impact and depend upon the design of the
1390
- generic syntax, requiring a uniform parsing algorithm in order to be
1391
- interpreted consistently.
1392
-
1393
- 4.1. URI Reference
1394
-
1395
- URI-reference is used to denote the most common usage of a resource
1396
- identifier.
1397
-
1398
- URI-reference = URI / relative-ref
1399
-
1400
-
1401
-
1402
- Berners-Lee, et al. Standards Track [Page 25]
1403
-
1404
- RFC 3986 URI Generic Syntax January 2005
1405
-
1406
-
1407
- A URI-reference is either a URI or a relative reference. If the
1408
- URI-reference's prefix does not match the syntax of a scheme followed
1409
- by its colon separator, then the URI-reference is a relative
1410
- reference.
1411
-
1412
- A URI-reference is typically parsed first into the five URI
1413
- components, in order to determine what components are present and
1414
- whether the reference is relative. Then, each component is parsed
1415
- for its subparts and their validation. The ABNF of URI-reference,
1416
- along with the "first-match-wins" disambiguation rule, is sufficient
1417
- to define a validating parser for the generic syntax. Readers
1418
- familiar with regular expressions should see Appendix B for an
1419
- example of a non-validating URI-reference parser that will take any
1420
- given string and extract the URI components.
1421
-
1422
- 4.2. Relative Reference
1423
-
1424
- A relative reference takes advantage of the hierarchical syntax
1425
- (Section 1.2.3) to express a URI reference relative to the name space
1426
- of another hierarchical URI.
1427
-
1428
- relative-ref = relative-part [ "?" query ] [ "#" fragment ]
1429
-
1430
- relative-part = "//" authority path-abempty
1431
- / path-absolute
1432
- / path-noscheme
1433
- / path-empty
1434
-
1435
- The URI referred to by a relative reference, also known as the target
1436
- URI, is obtained by applying the reference resolution algorithm of
1437
- Section 5.
1438
-
1439
- A relative reference that begins with two slash characters is termed
1440
- a network-path reference; such references are rarely used. A
1441
- relative reference that begins with a single slash character is
1442
- termed an absolute-path reference. A relative reference that does
1443
- not begin with a slash character is termed a relative-path reference.
1444
-
1445
- A path segment that contains a colon character (e.g., "this:that")
1446
- cannot be used as the first segment of a relative-path reference, as
1447
- it would be mistaken for a scheme name. Such a segment must be
1448
- preceded by a dot-segment (e.g., "./this:that") to make a relative-
1449
- path reference.
1450
-
1451
-
1452
-
1453
-
1454
-
1455
-
1456
-
1457
-
1458
- Berners-Lee, et al. Standards Track [Page 26]
1459
-
1460
- RFC 3986 URI Generic Syntax January 2005
1461
-
1462
-
1463
- 4.3. Absolute URI
1464
-
1465
- Some protocol elements allow only the absolute form of a URI without
1466
- a fragment identifier. For example, defining a base URI for later
1467
- use by relative references calls for an absolute-URI syntax rule that
1468
- does not allow a fragment.
1469
-
1470
- absolute-URI = scheme ":" hier-part [ "?" query ]
1471
-
1472
- URI scheme specifications must define their own syntax so that all
1473
- strings matching their scheme-specific syntax will also match the
1474
- <absolute-URI> grammar. Scheme specifications will not define
1475
- fragment identifier syntax or usage, regardless of its applicability
1476
- to resources identifiable via that scheme, as fragment identification
1477
- is orthogonal to scheme definition. However, scheme specifications
1478
- are encouraged to include a wide range of examples, including
1479
- examples that show use of the scheme's URIs with fragment identifiers
1480
- when such usage is appropriate.
1481
-
1482
- 4.4. Same-Document Reference
1483
-
1484
- When a URI reference refers to a URI that is, aside from its fragment
1485
- component (if any), identical to the base URI (Section 5.1), that
1486
- reference is called a "same-document" reference. The most frequent
1487
- examples of same-document references are relative references that are
1488
- empty or include only the number sign ("#") separator followed by a
1489
- fragment identifier.
1490
-
1491
- When a same-document reference is dereferenced for a retrieval
1492
- action, the target of that reference is defined to be within the same
1493
- entity (representation, document, or message) as the reference;
1494
- therefore, a dereference should not result in a new retrieval action.
1495
-
1496
- Normalization of the base and target URIs prior to their comparison,
1497
- as described in Sections 6.2.2 and 6.2.3, is allowed but rarely
1498
- performed in practice. Normalization may increase the set of same-
1499
- document references, which may be of benefit to some caching
1500
- applications. As such, reference authors should not assume that a
1501
- slightly different, though equivalent, reference URI will (or will
1502
- not) be interpreted as a same-document reference by any given
1503
- application.
1504
-
1505
- 4.5. Suffix Reference
1506
-
1507
- The URI syntax is designed for unambiguous reference to resources and
1508
- extensibility via the URI scheme. However, as URI identification and
1509
- usage have become commonplace, traditional media (television, radio,
1510
- newspapers, billboards, etc.) have increasingly used a suffix of the
1511
-
1512
-
1513
-
1514
- Berners-Lee, et al. Standards Track [Page 27]
1515
-
1516
- RFC 3986 URI Generic Syntax January 2005
1517
-
1518
-
1519
- URI as a reference, consisting of only the authority and path
1520
- portions of the URI, such as
1521
-
1522
- www.w3.org/Addressing/
1523
-
1524
- or simply a DNS registered name on its own. Such references are
1525
- primarily intended for human interpretation rather than for machines,
1526
- with the assumption that context-based heuristics are sufficient to
1527
- complete the URI (e.g., most registered names beginning with "www"
1528
- are likely to have a URI prefix of "http://"). Although there is no
1529
- standard set of heuristics for disambiguating a URI suffix, many
1530
- client implementations allow them to be entered by the user and
1531
- heuristically resolved.
1532
-
1533
- Although this practice of using suffix references is common, it
1534
- should be avoided whenever possible and should never be used in
1535
- situations where long-term references are expected. The heuristics
1536
- noted above will change over time, particularly when a new URI scheme
1537
- becomes popular, and are often incorrect when used out of context.
1538
- Furthermore, they can lead to security issues along the lines of
1539
- those described in [RFC1535].
1540
-
1541
- As a URI suffix has the same syntax as a relative-path reference, a
1542
- suffix reference cannot be used in contexts where a relative
1543
- reference is expected. As a result, suffix references are limited to
1544
- places where there is no defined base URI, such as dialog boxes and
1545
- off-line advertisements.
1546
-
1547
- 5. Reference Resolution
1548
-
1549
- This section defines the process of resolving a URI reference within
1550
- a context that allows relative references so that the result is a
1551
- string matching the <URI> syntax rule of Section 3.
1552
-
1553
- 5.1. Establishing a Base URI
1554
-
1555
- The term "relative" implies that a "base URI" exists against which
1556
- the relative reference is applied. Aside from fragment-only
1557
- references (Section 4.4), relative references are only usable when a
1558
- base URI is known. A base URI must be established by the parser
1559
- prior to parsing URI references that might be relative. A base URI
1560
- must conform to the <absolute-URI> syntax rule (Section 4.3). If the
1561
- base URI is obtained from a URI reference, then that reference must
1562
- be converted to absolute form and stripped of any fragment component
1563
- prior to its use as a base URI.
1564
-
1565
-
1566
-
1567
-
1568
-
1569
-
1570
- Berners-Lee, et al. Standards Track [Page 28]
1571
-
1572
- RFC 3986 URI Generic Syntax January 2005
1573
-
1574
-
1575
- The base URI of a reference can be established in one of four ways,
1576
- discussed below in order of precedence. The order of precedence can
1577
- be thought of in terms of layers, where the innermost defined base
1578
- URI has the highest precedence. This can be visualized graphically
1579
- as follows:
1580
-
1581
- .----------------------------------------------------------.
1582
- | .----------------------------------------------------. |
1583
- | | .----------------------------------------------. | |
1584
- | | | .----------------------------------------. | | |
1585
- | | | | .----------------------------------. | | | |
1586
- | | | | | <relative-reference> | | | | |
1587
- | | | | `----------------------------------' | | | |
1588
- | | | | (5.1.1) Base URI embedded in content | | | |
1589
- | | | `----------------------------------------' | | |
1590
- | | | (5.1.2) Base URI of the encapsulating entity | | |
1591
- | | | (message, representation, or none) | | |
1592
- | | `----------------------------------------------' | |
1593
- | | (5.1.3) URI used to retrieve the entity | |
1594
- | `----------------------------------------------------' |
1595
- | (5.1.4) Default Base URI (application-dependent) |
1596
- `----------------------------------------------------------'
1597
-
1598
- 5.1.1. Base URI Embedded in Content
1599
-
1600
- Within certain media types, a base URI for relative references can be
1601
- embedded within the content itself so that it can be readily obtained
1602
- by a parser. This can be useful for descriptive documents, such as
1603
- tables of contents, which may be transmitted to others through
1604
- protocols other than their usual retrieval context (e.g., email or
1605
- USENET news).
1606
-
1607
- It is beyond the scope of this specification to specify how, for each
1608
- media type, a base URI can be embedded. The appropriate syntax, when
1609
- available, is described by the data format specification associated
1610
- with each media type.
1611
-
1612
- 5.1.2. Base URI from the Encapsulating Entity
1613
-
1614
- If no base URI is embedded, the base URI is defined by the
1615
- representation's retrieval context. For a document that is enclosed
1616
- within another entity, such as a message or archive, the retrieval
1617
- context is that entity. Thus, the default base URI of a
1618
- representation is the base URI of the entity in which the
1619
- representation is encapsulated.
1620
-
1621
-
1622
-
1623
-
1624
-
1625
-
1626
- Berners-Lee, et al. Standards Track [Page 29]
1627
-
1628
- RFC 3986 URI Generic Syntax January 2005
1629
-
1630
-
1631
- A mechanism for embedding a base URI within MIME container types
1632
- (e.g., the message and multipart types) is defined by MHTML
1633
- [RFC2557]. Protocols that do not use the MIME message header syntax,
1634
- but that do allow some form of tagged metadata to be included within
1635
- messages, may define their own syntax for defining a base URI as part
1636
- of a message.
1637
-
1638
- 5.1.3. Base URI from the Retrieval URI
1639
-
1640
- If no base URI is embedded and the representation is not encapsulated
1641
- within some other entity, then, if a URI was used to retrieve the
1642
- representation, that URI shall be considered the base URI. Note that
1643
- if the retrieval was the result of a redirected request, the last URI
1644
- used (i.e., the URI that resulted in the actual retrieval of the
1645
- representation) is the base URI.
1646
-
1647
- 5.1.4. Default Base URI
1648
-
1649
- If none of the conditions described above apply, then the base URI is
1650
- defined by the context of the application. As this definition is
1651
- necessarily application-dependent, failing to define a base URI by
1652
- using one of the other methods may result in the same content being
1653
- interpreted differently by different types of applications.
1654
-
1655
- A sender of a representation containing relative references is
1656
- responsible for ensuring that a base URI for those references can be
1657
- established. Aside from fragment-only references, relative
1658
- references can only be used reliably in situations where the base URI
1659
- is well defined.
1660
-
1661
- 5.2. Relative Resolution
1662
-
1663
- This section describes an algorithm for converting a URI reference
1664
- that might be relative to a given base URI into the parsed components
1665
- of the reference's target. The components can then be recomposed, as
1666
- described in Section 5.3, to form the target URI. This algorithm
1667
- provides definitive results that can be used to test the output of
1668
- other implementations. Applications may implement relative reference
1669
- resolution by using some other algorithm, provided that the results
1670
- match what would be given by this one.
1671
-
1672
-
1673
-
1674
-
1675
-
1676
-
1677
-
1678
-
1679
-
1680
-
1681
-
1682
- Berners-Lee, et al. Standards Track [Page 30]
1683
-
1684
- RFC 3986 URI Generic Syntax January 2005
1685
-
1686
-
1687
- 5.2.1. Pre-parse the Base URI
1688
-
1689
- The base URI (Base) is established according to the procedure of
1690
- Section 5.1 and parsed into the five main components described in
1691
- Section 3. Note that only the scheme component is required to be
1692
- present in a base URI; the other components may be empty or
1693
- undefined. A component is undefined if its associated delimiter does
1694
- not appear in the URI reference; the path component is never
1695
- undefined, though it may be empty.
1696
-
1697
- Normalization of the base URI, as described in Sections 6.2.2 and
1698
- 6.2.3, is optional. A URI reference must be transformed to its
1699
- target URI before it can be normalized.
1700
-
1701
- 5.2.2. Transform References
1702
-
1703
- For each URI reference (R), the following pseudocode describes an
1704
- algorithm for transforming R into its target URI (T):
1705
-
1706
- -- The URI reference is parsed into the five URI components
1707
- --
1708
- (R.scheme, R.authority, R.path, R.query, R.fragment) = parse(R);
1709
-
1710
- -- A non-strict parser may ignore a scheme in the reference
1711
- -- if it is identical to the base URI's scheme.
1712
- --
1713
- if ((not strict) and (R.scheme == Base.scheme)) then
1714
- undefine(R.scheme);
1715
- endif;
1716
-
1717
-
1718
-
1719
-
1720
-
1721
-
1722
-
1723
-
1724
-
1725
-
1726
-
1727
-
1728
-
1729
-
1730
-
1731
-
1732
-
1733
-
1734
-
1735
-
1736
-
1737
-
1738
- Berners-Lee, et al. Standards Track [Page 31]
1739
-
1740
- RFC 3986 URI Generic Syntax January 2005
1741
-
1742
-
1743
- if defined(R.scheme) then
1744
- T.scheme = R.scheme;
1745
- T.authority = R.authority;
1746
- T.path = remove_dot_segments(R.path);
1747
- T.query = R.query;
1748
- else
1749
- if defined(R.authority) then
1750
- T.authority = R.authority;
1751
- T.path = remove_dot_segments(R.path);
1752
- T.query = R.query;
1753
- else
1754
- if (R.path == "") then
1755
- T.path = Base.path;
1756
- if defined(R.query) then
1757
- T.query = R.query;
1758
- else
1759
- T.query = Base.query;
1760
- endif;
1761
- else
1762
- if (R.path starts-with "/") then
1763
- T.path = remove_dot_segments(R.path);
1764
- else
1765
- T.path = merge(Base.path, R.path);
1766
- T.path = remove_dot_segments(T.path);
1767
- endif;
1768
- T.query = R.query;
1769
- endif;
1770
- T.authority = Base.authority;
1771
- endif;
1772
- T.scheme = Base.scheme;
1773
- endif;
1774
-
1775
- T.fragment = R.fragment;
1776
-
1777
- 5.2.3. Merge Paths
1778
-
1779
- The pseudocode above refers to a "merge" routine for merging a
1780
- relative-path reference with the path of the base URI. This is
1781
- accomplished as follows:
1782
-
1783
- o If the base URI has a defined authority component and an empty
1784
- path, then return a string consisting of "/" concatenated with the
1785
- reference's path; otherwise,
1786
-
1787
-
1788
-
1789
-
1790
-
1791
-
1792
-
1793
-
1794
- Berners-Lee, et al. Standards Track [Page 32]
1795
-
1796
- RFC 3986 URI Generic Syntax January 2005
1797
-
1798
-
1799
- o return a string consisting of the reference's path component
1800
- appended to all but the last segment of the base URI's path (i.e.,
1801
- excluding any characters after the right-most "/" in the base URI
1802
- path, or excluding the entire base URI path if it does not contain
1803
- any "/" characters).
1804
-
1805
- 5.2.4. Remove Dot Segments
1806
-
1807
- The pseudocode also refers to a "remove_dot_segments" routine for
1808
- interpreting and removing the special "." and ".." complete path
1809
- segments from a referenced path. This is done after the path is
1810
- extracted from a reference, whether or not the path was relative, in
1811
- order to remove any invalid or extraneous dot-segments prior to
1812
- forming the target URI. Although there are many ways to accomplish
1813
- this removal process, we describe a simple method using two string
1814
- buffers.
1815
-
1816
- 1. The input buffer is initialized with the now-appended path
1817
- components and the output buffer is initialized to the empty
1818
- string.
1819
-
1820
- 2. While the input buffer is not empty, loop as follows:
1821
-
1822
- A. If the input buffer begins with a prefix of "../" or "./",
1823
- then remove that prefix from the input buffer; otherwise,
1824
-
1825
- B. if the input buffer begins with a prefix of "/./" or "/.",
1826
- where "." is a complete path segment, then replace that
1827
- prefix with "/" in the input buffer; otherwise,
1828
-
1829
- C. if the input buffer begins with a prefix of "/../" or "/..",
1830
- where ".." is a complete path segment, then replace that
1831
- prefix with "/" in the input buffer and remove the last
1832
- segment and its preceding "/" (if any) from the output
1833
- buffer; otherwise,
1834
-
1835
- D. if the input buffer consists only of "." or "..", then remove
1836
- that from the input buffer; otherwise,
1837
-
1838
- E. move the first path segment in the input buffer to the end of
1839
- the output buffer, including the initial "/" character (if
1840
- any) and any subsequent characters up to, but not including,
1841
- the next "/" character or the end of the input buffer.
1842
-
1843
- 3. Finally, the output buffer is returned as the result of
1844
- remove_dot_segments.
1845
-
1846
-
1847
-
1848
-
1849
-
1850
- Berners-Lee, et al. Standards Track [Page 33]
1851
-
1852
- RFC 3986 URI Generic Syntax January 2005
1853
-
1854
-
1855
- Note that dot-segments are intended for use in URI references to
1856
- express an identifier relative to the hierarchy of names in the base
1857
- URI. The remove_dot_segments algorithm respects that hierarchy by
1858
- removing extra dot-segments rather than treat them as an error or
1859
- leaving them to be misinterpreted by dereference implementations.
1860
-
1861
- The following illustrates how the above steps are applied for two
1862
- examples of merged paths, showing the state of the two buffers after
1863
- each step.
1864
-
1865
- STEP OUTPUT BUFFER INPUT BUFFER
1866
-
1867
- 1 : /a/b/c/./../../g
1868
- 2E: /a /b/c/./../../g
1869
- 2E: /a/b /c/./../../g
1870
- 2E: /a/b/c /./../../g
1871
- 2B: /a/b/c /../../g
1872
- 2C: /a/b /../g
1873
- 2C: /a /g
1874
- 2E: /a/g
1875
-
1876
- STEP OUTPUT BUFFER INPUT BUFFER
1877
-
1878
- 1 : mid/content=5/../6
1879
- 2E: mid /content=5/../6
1880
- 2E: mid/content=5 /../6
1881
- 2C: mid /6
1882
- 2E: mid/6
1883
-
1884
- Some applications may find it more efficient to implement the
1885
- remove_dot_segments algorithm by using two segment stacks rather than
1886
- strings.
1887
-
1888
- Note: Beware that some older, erroneous implementations will fail
1889
- to separate a reference's query component from its path component
1890
- prior to merging the base and reference paths, resulting in an
1891
- interoperability failure if the query component contains the
1892
- strings "/../" or "/./".
1893
-
1894
-
1895
-
1896
-
1897
-
1898
-
1899
-
1900
-
1901
-
1902
-
1903
-
1904
-
1905
-
1906
- Berners-Lee, et al. Standards Track [Page 34]
1907
-
1908
- RFC 3986 URI Generic Syntax January 2005
1909
-
1910
-
1911
- 5.3. Component Recomposition
1912
-
1913
- Parsed URI components can be recomposed to obtain the corresponding
1914
- URI reference string. Using pseudocode, this would be:
1915
-
1916
- result = ""
1917
-
1918
- if defined(scheme) then
1919
- append scheme to result;
1920
- append ":" to result;
1921
- endif;
1922
-
1923
- if defined(authority) then
1924
- append "//" to result;
1925
- append authority to result;
1926
- endif;
1927
-
1928
- append path to result;
1929
-
1930
- if defined(query) then
1931
- append "?" to result;
1932
- append query to result;
1933
- endif;
1934
-
1935
- if defined(fragment) then
1936
- append "#" to result;
1937
- append fragment to result;
1938
- endif;
1939
-
1940
- return result;
1941
-
1942
- Note that we are careful to preserve the distinction between a
1943
- component that is undefined, meaning that its separator was not
1944
- present in the reference, and a component that is empty, meaning that
1945
- the separator was present and was immediately followed by the next
1946
- component separator or the end of the reference.
1947
-
1948
- 5.4. Reference Resolution Examples
1949
-
1950
- Within a representation with a well defined base URI of
1951
-
1952
- http://a/b/c/d;p?q
1953
-
1954
- a relative reference is transformed to its target URI as follows.
1955
-
1956
-
1957
-
1958
-
1959
-
1960
-
1961
-
1962
- Berners-Lee, et al. Standards Track [Page 35]
1963
-
1964
- RFC 3986 URI Generic Syntax January 2005
1965
-
1966
-
1967
- 5.4.1. Normal Examples
1968
-
1969
- "g:h" = "g:h"
1970
- "g" = "http://a/b/c/g"
1971
- "./g" = "http://a/b/c/g"
1972
- "g/" = "http://a/b/c/g/"
1973
- "/g" = "http://a/g"
1974
- "//g" = "http://g"
1975
- "?y" = "http://a/b/c/d;p?y"
1976
- "g?y" = "http://a/b/c/g?y"
1977
- "#s" = "http://a/b/c/d;p?q#s"
1978
- "g#s" = "http://a/b/c/g#s"
1979
- "g?y#s" = "http://a/b/c/g?y#s"
1980
- ";x" = "http://a/b/c/;x"
1981
- "g;x" = "http://a/b/c/g;x"
1982
- "g;x?y#s" = "http://a/b/c/g;x?y#s"
1983
- "" = "http://a/b/c/d;p?q"
1984
- "." = "http://a/b/c/"
1985
- "./" = "http://a/b/c/"
1986
- ".." = "http://a/b/"
1987
- "../" = "http://a/b/"
1988
- "../g" = "http://a/b/g"
1989
- "../.." = "http://a/"
1990
- "../../" = "http://a/"
1991
- "../../g" = "http://a/g"
1992
-
1993
- 5.4.2. Abnormal Examples
1994
-
1995
- Although the following abnormal examples are unlikely to occur in
1996
- normal practice, all URI parsers should be capable of resolving them
1997
- consistently. Each example uses the same base as that above.
1998
-
1999
- Parsers must be careful in handling cases where there are more ".."
2000
- segments in a relative-path reference than there are hierarchical
2001
- levels in the base URI's path. Note that the ".." syntax cannot be
2002
- used to change the authority component of a URI.
2003
-
2004
- "../../../g" = "http://a/g"
2005
- "../../../../g" = "http://a/g"
2006
-
2007
-
2008
-
2009
-
2010
-
2011
-
2012
-
2013
-
2014
-
2015
-
2016
-
2017
-
2018
- Berners-Lee, et al. Standards Track [Page 36]
2019
-
2020
- RFC 3986 URI Generic Syntax January 2005
2021
-
2022
-
2023
- Similarly, parsers must remove the dot-segments "." and ".." when
2024
- they are complete components of a path, but not when they are only
2025
- part of a segment.
2026
-
2027
- "/./g" = "http://a/g"
2028
- "/../g" = "http://a/g"
2029
- "g." = "http://a/b/c/g."
2030
- ".g" = "http://a/b/c/.g"
2031
- "g.." = "http://a/b/c/g.."
2032
- "..g" = "http://a/b/c/..g"
2033
-
2034
- Less likely are cases where the relative reference uses unnecessary
2035
- or nonsensical forms of the "." and ".." complete path segments.
2036
-
2037
- "./../g" = "http://a/b/g"
2038
- "./g/." = "http://a/b/c/g/"
2039
- "g/./h" = "http://a/b/c/g/h"
2040
- "g/../h" = "http://a/b/c/h"
2041
- "g;x=1/./y" = "http://a/b/c/g;x=1/y"
2042
- "g;x=1/../y" = "http://a/b/c/y"
2043
-
2044
- Some applications fail to separate the reference's query and/or
2045
- fragment components from the path component before merging it with
2046
- the base path and removing dot-segments. This error is rarely
2047
- noticed, as typical usage of a fragment never includes the hierarchy
2048
- ("/") character and the query component is not normally used within
2049
- relative references.
2050
-
2051
- "g?y/./x" = "http://a/b/c/g?y/./x"
2052
- "g?y/../x" = "http://a/b/c/g?y/../x"
2053
- "g#s/./x" = "http://a/b/c/g#s/./x"
2054
- "g#s/../x" = "http://a/b/c/g#s/../x"
2055
-
2056
- Some parsers allow the scheme name to be present in a relative
2057
- reference if it is the same as the base URI scheme. This is
2058
- considered to be a loophole in prior specifications of partial URI
2059
- [RFC1630]. Its use should be avoided but is allowed for backward
2060
- compatibility.
2061
-
2062
- "http:g" = "http:g" ; for strict parsers
2063
- / "http://a/b/c/g" ; for backward compatibility
2064
-
2065
-
2066
-
2067
-
2068
-
2069
-
2070
-
2071
-
2072
-
2073
-
2074
- Berners-Lee, et al. Standards Track [Page 37]
2075
-
2076
- RFC 3986 URI Generic Syntax January 2005
2077
-
2078
-
2079
- 6. Normalization and Comparison
2080
-
2081
- One of the most common operations on URIs is simple comparison:
2082
- determining whether two URIs are equivalent without using the URIs to
2083
- access their respective resource(s). A comparison is performed every
2084
- time a response cache is accessed, a browser checks its history to
2085
- color a link, or an XML parser processes tags within a namespace.
2086
- Extensive normalization prior to comparison of URIs is often used by
2087
- spiders and indexing engines to prune a search space or to reduce
2088
- duplication of request actions and response storage.
2089
-
2090
- URI comparison is performed for some particular purpose. Protocols
2091
- or implementations that compare URIs for different purposes will
2092
- often be subject to differing design trade-offs in regards to how
2093
- much effort should be spent in reducing aliased identifiers. This
2094
- section describes various methods that may be used to compare URIs,
2095
- the trade-offs between them, and the types of applications that might
2096
- use them.
2097
-
2098
- 6.1. Equivalence
2099
-
2100
- Because URIs exist to identify resources, presumably they should be
2101
- considered equivalent when they identify the same resource. However,
2102
- this definition of equivalence is not of much practical use, as there
2103
- is no way for an implementation to compare two resources unless it
2104
- has full knowledge or control of them. For this reason,
2105
- determination of equivalence or difference of URIs is based on string
2106
- comparison, perhaps augmented by reference to additional rules
2107
- provided by URI scheme definitions. We use the terms "different" and
2108
- "equivalent" to describe the possible outcomes of such comparisons,
2109
- but there are many application-dependent versions of equivalence.
2110
-
2111
- Even though it is possible to determine that two URIs are equivalent,
2112
- URI comparison is not sufficient to determine whether two URIs
2113
- identify different resources. For example, an owner of two different
2114
- domain names could decide to serve the same resource from both,
2115
- resulting in two different URIs. Therefore, comparison methods are
2116
- designed to minimize false negatives while strictly avoiding false
2117
- positives.
2118
-
2119
- In testing for equivalence, applications should not directly compare
2120
- relative references; the references should be converted to their
2121
- respective target URIs before comparison. When URIs are compared to
2122
- select (or avoid) a network action, such as retrieval of a
2123
- representation, fragment components (if any) should be excluded from
2124
- the comparison.
2125
-
2126
-
2127
-
2128
-
2129
-
2130
- Berners-Lee, et al. Standards Track [Page 38]
2131
-
2132
- RFC 3986 URI Generic Syntax January 2005
2133
-
2134
-
2135
- 6.2. Comparison Ladder
2136
-
2137
- A variety of methods are used in practice to test URI equivalence.
2138
- These methods fall into a range, distinguished by the amount of
2139
- processing required and the degree to which the probability of false
2140
- negatives is reduced. As noted above, false negatives cannot be
2141
- eliminated. In practice, their probability can be reduced, but this
2142
- reduction requires more processing and is not cost-effective for all
2143
- applications.
2144
-
2145
- If this range of comparison practices is considered as a ladder, the
2146
- following discussion will climb the ladder, starting with practices
2147
- that are cheap but have a relatively higher chance of producing false
2148
- negatives, and proceeding to those that have higher computational
2149
- cost and lower risk of false negatives.
2150
-
2151
- 6.2.1. Simple String Comparison
2152
-
2153
- If two URIs, when considered as character strings, are identical,
2154
- then it is safe to conclude that they are equivalent. This type of
2155
- equivalence test has very low computational cost and is in wide use
2156
- in a variety of applications, particularly in the domain of parsing.
2157
-
2158
- Testing strings for equivalence requires some basic precautions.
2159
- This procedure is often referred to as "bit-for-bit" or
2160
- "byte-for-byte" comparison, which is potentially misleading. Testing
2161
- strings for equality is normally based on pair comparison of the
2162
- characters that make up the strings, starting from the first and
2163
- proceeding until both strings are exhausted and all characters are
2164
- found to be equal, until a pair of characters compares unequal, or
2165
- until one of the strings is exhausted before the other.
2166
-
2167
- This character comparison requires that each pair of characters be
2168
- put in comparable form. For example, should one URI be stored in a
2169
- byte array in EBCDIC encoding and the second in a Java String object
2170
- (UTF-16), bit-for-bit comparisons applied naively will produce
2171
- errors. It is better to speak of equality on a character-for-
2172
- character basis rather than on a byte-for-byte or bit-for-bit basis.
2173
- In practical terms, character-by-character comparisons should be done
2174
- codepoint-by-codepoint after conversion to a common character
2175
- encoding.
2176
-
2177
- False negatives are caused by the production and use of URI aliases.
2178
- Unnecessary aliases can be reduced, regardless of the comparison
2179
- method, by consistently providing URI references in an already-
2180
- normalized form (i.e., a form identical to what would be produced
2181
- after normalization is applied, as described below).
2182
-
2183
-
2184
-
2185
-
2186
- Berners-Lee, et al. Standards Track [Page 39]
2187
-
2188
- RFC 3986 URI Generic Syntax January 2005
2189
-
2190
-
2191
- Protocols and data formats often limit some URI comparisons to simple
2192
- string comparison, based on the theory that people and
2193
- implementations will, in their own best interest, be consistent in
2194
- providing URI references, or at least consistent enough to negate any
2195
- efficiency that might be obtained from further normalization.
2196
-
2197
- 6.2.2. Syntax-Based Normalization
2198
-
2199
- Implementations may use logic based on the definitions provided by
2200
- this specification to reduce the probability of false negatives.
2201
- This processing is moderately higher in cost than character-for-
2202
- character string comparison. For example, an application using this
2203
- approach could reasonably consider the following two URIs equivalent:
2204
-
2205
- example://a/b/c/%7Bfoo%7D
2206
- eXAMPLE://a/./b/../b/%63/%7bfoo%7d
2207
-
2208
- Web user agents, such as browsers, typically apply this type of URI
2209
- normalization when determining whether a cached response is
2210
- available. Syntax-based normalization includes such techniques as
2211
- case normalization, percent-encoding normalization, and removal of
2212
- dot-segments.
2213
-
2214
- 6.2.2.1. Case Normalization
2215
-
2216
- For all URIs, the hexadecimal digits within a percent-encoding
2217
- triplet (e.g., "%3a" versus "%3A") are case-insensitive and therefore
2218
- should be normalized to use uppercase letters for the digits A-F.
2219
-
2220
- When a URI uses components of the generic syntax, the component
2221
- syntax equivalence rules always apply; namely, that the scheme and
2222
- host are case-insensitive and therefore should be normalized to
2223
- lowercase. For example, the URI <HTTP://www.EXAMPLE.com/> is
2224
- equivalent to <http://www.example.com/>. The other generic syntax
2225
- components are assumed to be case-sensitive unless specifically
2226
- defined otherwise by the scheme (see Section 6.2.3).
2227
-
2228
- 6.2.2.2. Percent-Encoding Normalization
2229
-
2230
- The percent-encoding mechanism (Section 2.1) is a frequent source of
2231
- variance among otherwise identical URIs. In addition to the case
2232
- normalization issue noted above, some URI producers percent-encode
2233
- octets that do not require percent-encoding, resulting in URIs that
2234
- are equivalent to their non-encoded counterparts. These URIs should
2235
- be normalized by decoding any percent-encoded octet that corresponds
2236
- to an unreserved character, as described in Section 2.3.
2237
-
2238
-
2239
-
2240
-
2241
-
2242
- Berners-Lee, et al. Standards Track [Page 40]
2243
-
2244
- RFC 3986 URI Generic Syntax January 2005
2245
-
2246
-
2247
- 6.2.2.3. Path Segment Normalization
2248
-
2249
- The complete path segments "." and ".." are intended only for use
2250
- within relative references (Section 4.1) and are removed as part of
2251
- the reference resolution process (Section 5.2). However, some
2252
- deployed implementations incorrectly assume that reference resolution
2253
- is not necessary when the reference is already a URI and thus fail to
2254
- remove dot-segments when they occur in non-relative paths. URI
2255
- normalizers should remove dot-segments by applying the
2256
- remove_dot_segments algorithm to the path, as described in
2257
- Section 5.2.4.
2258
-
2259
- 6.2.3. Scheme-Based Normalization
2260
-
2261
- The syntax and semantics of URIs vary from scheme to scheme, as
2262
- described by the defining specification for each scheme.
2263
- Implementations may use scheme-specific rules, at further processing
2264
- cost, to reduce the probability of false negatives. For example,
2265
- because the "http" scheme makes use of an authority component, has a
2266
- default port of "80", and defines an empty path to be equivalent to
2267
- "/", the following four URIs are equivalent:
2268
-
2269
- http://example.com
2270
- http://example.com/
2271
- http://example.com:/
2272
- http://example.com:80/
2273
-
2274
- In general, a URI that uses the generic syntax for authority with an
2275
- empty path should be normalized to a path of "/". Likewise, an
2276
- explicit ":port", for which the port is empty or the default for the
2277
- scheme, is equivalent to one where the port and its ":" delimiter are
2278
- elided and thus should be removed by scheme-based normalization. For
2279
- example, the second URI above is the normal form for the "http"
2280
- scheme.
2281
-
2282
- Another case where normalization varies by scheme is in the handling
2283
- of an empty authority component or empty host subcomponent. For many
2284
- scheme specifications, an empty authority or host is considered an
2285
- error; for others, it is considered equivalent to "localhost" or the
2286
- end-user's host. When a scheme defines a default for authority and a
2287
- URI reference to that default is desired, the reference should be
2288
- normalized to an empty authority for the sake of uniformity, brevity,
2289
- and internationalization. If, however, either the userinfo or port
2290
- subcomponents are non-empty, then the host should be given explicitly
2291
- even if it matches the default.
2292
-
2293
- Normalization should not remove delimiters when their associated
2294
- component is empty unless licensed to do so by the scheme
2295
-
2296
-
2297
-
2298
- Berners-Lee, et al. Standards Track [Page 41]
2299
-
2300
- RFC 3986 URI Generic Syntax January 2005
2301
-
2302
-
2303
- specification. For example, the URI "http://example.com/?" cannot be
2304
- assumed to be equivalent to any of the examples above. Likewise, the
2305
- presence or absence of delimiters within a userinfo subcomponent is
2306
- usually significant to its interpretation. The fragment component is
2307
- not subject to any scheme-based normalization; thus, two URIs that
2308
- differ only by the suffix "#" are considered different regardless of
2309
- the scheme.
2310
-
2311
- Some schemes define additional subcomponents that consist of case-
2312
- insensitive data, giving an implicit license to normalizers to
2313
- convert this data to a common case (e.g., all lowercase). For
2314
- example, URI schemes that define a subcomponent of path to contain an
2315
- Internet hostname, such as the "mailto" URI scheme, cause that
2316
- subcomponent to be case-insensitive and thus subject to case
2317
- normalization (e.g., "mailto:Joe@Example.COM" is equivalent to
2318
- "mailto:Joe@example.com", even though the generic syntax considers
2319
- the path component to be case-sensitive).
2320
-
2321
- Other scheme-specific normalizations are possible.
2322
-
2323
- 6.2.4. Protocol-Based Normalization
2324
-
2325
- Substantial effort to reduce the incidence of false negatives is
2326
- often cost-effective for web spiders. Therefore, they implement even
2327
- more aggressive techniques in URI comparison. For example, if they
2328
- observe that a URI such as
2329
-
2330
- http://example.com/data
2331
-
2332
- redirects to a URI differing only in the trailing slash
2333
-
2334
- http://example.com/data/
2335
-
2336
- they will likely regard the two as equivalent in the future. This
2337
- kind of technique is only appropriate when equivalence is clearly
2338
- indicated by both the result of accessing the resources and the
2339
- common conventions of their scheme's dereference algorithm (in this
2340
- case, use of redirection by HTTP origin servers to avoid problems
2341
- with relative references).
2342
-
2343
-
2344
-
2345
-
2346
-
2347
-
2348
-
2349
-
2350
-
2351
-
2352
-
2353
-
2354
- Berners-Lee, et al. Standards Track [Page 42]
2355
-
2356
- RFC 3986 URI Generic Syntax January 2005
2357
-
2358
-
2359
- 7. Security Considerations
2360
-
2361
- A URI does not in itself pose a security threat. However, as URIs
2362
- are often used to provide a compact set of instructions for access to
2363
- network resources, care must be taken to properly interpret the data
2364
- within a URI, to prevent that data from causing unintended access,
2365
- and to avoid including data that should not be revealed in plain
2366
- text.
2367
-
2368
- 7.1. Reliability and Consistency
2369
-
2370
- There is no guarantee that once a URI has been used to retrieve
2371
- information, the same information will be retrievable by that URI in
2372
- the future. Nor is there any guarantee that the information
2373
- retrievable via that URI in the future will be observably similar to
2374
- that retrieved in the past. The URI syntax does not constrain how a
2375
- given scheme or authority apportions its namespace or maintains it
2376
- over time. Such guarantees can only be obtained from the person(s)
2377
- controlling that namespace and the resource in question. A specific
2378
- URI scheme may define additional semantics, such as name persistence,
2379
- if those semantics are required of all naming authorities for that
2380
- scheme.
2381
-
2382
- 7.2. Malicious Construction
2383
-
2384
- It is sometimes possible to construct a URI so that an attempt to
2385
- perform a seemingly harmless, idempotent operation, such as the
2386
- retrieval of a representation, will in fact cause a possibly damaging
2387
- remote operation. The unsafe URI is typically constructed by
2388
- specifying a port number other than that reserved for the network
2389
- protocol in question. The client unwittingly contacts a site running
2390
- a different protocol service, and data within the URI contains
2391
- instructions that, when interpreted according to this other protocol,
2392
- cause an unexpected operation. A frequent example of such abuse has
2393
- been the use of a protocol-based scheme with a port component of
2394
- "25", thereby fooling user agent software into sending an unintended
2395
- or impersonating message via an SMTP server.
2396
-
2397
- Applications should prevent dereference of a URI that specifies a TCP
2398
- port number within the "well-known port" range (0 - 1023) unless the
2399
- protocol being used to dereference that URI is compatible with the
2400
- protocol expected on that well-known port. Although IANA maintains a
2401
- registry of well-known ports, applications should make such
2402
- restrictions user-configurable to avoid preventing the deployment of
2403
- new services.
2404
-
2405
-
2406
-
2407
-
2408
-
2409
-
2410
- Berners-Lee, et al. Standards Track [Page 43]
2411
-
2412
- RFC 3986 URI Generic Syntax January 2005
2413
-
2414
-
2415
- When a URI contains percent-encoded octets that match the delimiters
2416
- for a given resolution or dereference protocol (for example, CR and
2417
- LF characters for the TELNET protocol), these percent-encodings must
2418
- not be decoded before transmission across that protocol. Transfer of
2419
- the percent-encoding, which might violate the protocol, is less
2420
- harmful than allowing decoded octets to be interpreted as additional
2421
- operations or parameters, perhaps triggering an unexpected and
2422
- possibly harmful remote operation.
2423
-
2424
- 7.3. Back-End Transcoding
2425
-
2426
- When a URI is dereferenced, the data within it is often parsed by
2427
- both the user agent and one or more servers. In HTTP, for example, a
2428
- typical user agent will parse a URI into its five major components,
2429
- access the authority's server, and send it the data within the
2430
- authority, path, and query components. A typical server will take
2431
- that information, parse the path into segments and the query into
2432
- key/value pairs, and then invoke implementation-specific handlers to
2433
- respond to the request. As a result, a common security concern for
2434
- server implementations that handle a URI, either as a whole or split
2435
- into separate components, is proper interpretation of the octet data
2436
- represented by the characters and percent-encodings within that URI.
2437
-
2438
- Percent-encoded octets must be decoded at some point during the
2439
- dereference process. Applications must split the URI into its
2440
- components and subcomponents prior to decoding the octets, as
2441
- otherwise the decoded octets might be mistaken for delimiters.
2442
- Security checks of the data within a URI should be applied after
2443
- decoding the octets. Note, however, that the "%00" percent-encoding
2444
- (NUL) may require special handling and should be rejected if the
2445
- application is not expecting to receive raw data within a component.
2446
-
2447
- Special care should be taken when the URI path interpretation process
2448
- involves the use of a back-end file system or related system
2449
- functions. File systems typically assign an operational meaning to
2450
- special characters, such as the "/", "\", ":", "[", and "]"
2451
- characters, and to special device names like ".", "..", "...", "aux",
2452
- "lpt", etc. In some cases, merely testing for the existence of such
2453
- a name will cause the operating system to pause or invoke unrelated
2454
- system calls, leading to significant security concerns regarding
2455
- denial of service and unintended data transfer. It would be
2456
- impossible for this specification to list all such significant
2457
- characters and device names. Implementers should research the
2458
- reserved names and characters for the types of storage device that
2459
- may be attached to their applications and restrict the use of data
2460
- obtained from URI components accordingly.
2461
-
2462
-
2463
-
2464
-
2465
-
2466
- Berners-Lee, et al. Standards Track [Page 44]
2467
-
2468
- RFC 3986 URI Generic Syntax January 2005
2469
-
2470
-
2471
- 7.4. Rare IP Address Formats
2472
-
2473
- Although the URI syntax for IPv4address only allows the common
2474
- dotted-decimal form of IPv4 address literal, many implementations
2475
- that process URIs make use of platform-dependent system routines,
2476
- such as gethostbyname() and inet_aton(), to translate the string
2477
- literal to an actual IP address. Unfortunately, such system routines
2478
- often allow and process a much larger set of formats than those
2479
- described in Section 3.2.2.
2480
-
2481
- For example, many implementations allow dotted forms of three
2482
- numbers, wherein the last part is interpreted as a 16-bit quantity
2483
- and placed in the right-most two bytes of the network address (e.g.,
2484
- a Class B network). Likewise, a dotted form of two numbers means
2485
- that the last part is interpreted as a 24-bit quantity and placed in
2486
- the right-most three bytes of the network address (Class A), and a
2487
- single number (without dots) is interpreted as a 32-bit quantity and
2488
- stored directly in the network address. Adding further to the
2489
- confusion, some implementations allow each dotted part to be
2490
- interpreted as decimal, octal, or hexadecimal, as specified in the C
2491
- language (i.e., a leading 0x or 0X implies hexadecimal; a leading 0
2492
- implies octal; otherwise, the number is interpreted as decimal).
2493
-
2494
- These additional IP address formats are not allowed in the URI syntax
2495
- due to differences between platform implementations. However, they
2496
- can become a security concern if an application attempts to filter
2497
- access to resources based on the IP address in string literal format.
2498
- If this filtering is performed, literals should be converted to
2499
- numeric form and filtered based on the numeric value, and not on a
2500
- prefix or suffix of the string form.
2501
-
2502
- 7.5. Sensitive Information
2503
-
2504
- URI producers should not provide a URI that contains a username or
2505
- password that is intended to be secret. URIs are frequently
2506
- displayed by browsers, stored in clear text bookmarks, and logged by
2507
- user agent history and intermediary applications (proxies). A
2508
- password appearing within the userinfo component is deprecated and
2509
- should be considered an error (or simply ignored) except in those
2510
- rare cases where the 'password' parameter is intended to be public.
2511
-
2512
- 7.6. Semantic Attacks
2513
-
2514
- Because the userinfo subcomponent is rarely used and appears before
2515
- the host in the authority component, it can be used to construct a
2516
- URI intended to mislead a human user by appearing to identify one
2517
- (trusted) naming authority while actually identifying a different
2518
- authority hidden behind the noise. For example
2519
-
2520
-
2521
-
2522
- Berners-Lee, et al. Standards Track [Page 45]
2523
-
2524
- RFC 3986 URI Generic Syntax January 2005
2525
-
2526
-
2527
- ftp://cnn.example.com&story=breaking_news@10.0.0.1/top_story.htm
2528
-
2529
- might lead a human user to assume that the host is 'cnn.example.com',
2530
- whereas it is actually '10.0.0.1'. Note that a misleading userinfo
2531
- subcomponent could be much longer than the example above.
2532
-
2533
- A misleading URI, such as that above, is an attack on the user's
2534
- preconceived notions about the meaning of a URI rather than an attack
2535
- on the software itself. User agents may be able to reduce the impact
2536
- of such attacks by distinguishing the various components of the URI
2537
- when they are rendered, such as by using a different color or tone to
2538
- render userinfo if any is present, though there is no panacea. More
2539
- information on URI-based semantic attacks can be found in [Siedzik].
2540
-
2541
- 8. IANA Considerations
2542
-
2543
- URI scheme names, as defined by <scheme> in Section 3.1, form a
2544
- registered namespace that is managed by IANA according to the
2545
- procedures defined in [BCP35]. No IANA actions are required by this
2546
- document.
2547
-
2548
- 9. Acknowledgements
2549
-
2550
- This specification is derived from RFC 2396 [RFC2396], RFC 1808
2551
- [RFC1808], and RFC 1738 [RFC1738]; the acknowledgements in those
2552
- documents still apply. It also incorporates the update (with
2553
- corrections) for IPv6 literals in the host syntax, as defined by
2554
- Robert M. Hinden, Brian E. Carpenter, and Larry Masinter in
2555
- [RFC2732]. In addition, contributions by Gisle Aas, Reese Anschultz,
2556
- Daniel Barclay, Tim Bray, Mike Brown, Rob Cameron, Jeremy Carroll,
2557
- Dan Connolly, Adam M. Costello, John Cowan, Jason Diamond, Martin
2558
- Duerst, Stefan Eissing, Clive D.W. Feather, Al Gilman, Tony Hammond,
2559
- Elliotte Harold, Pat Hayes, Henry Holtzman, Ian B. Jacobs, Michael
2560
- Kay, John C. Klensin, Graham Klyne, Dan Kohn, Bruce Lilly, Andrew
2561
- Main, Dave McAlpin, Ira McDonald, Michael Mealling, Ray Merkert,
2562
- Stephen Pollei, Julian Reschke, Tomas Rokicki, Miles Sabin, Kai
2563
- Schaetzl, Mark Thomson, Ronald Tschalaer, Norm Walsh, Marc Warne,
2564
- Stuart Williams, and Henry Zongaro are gratefully acknowledged.
2565
-
2566
- 10. References
2567
-
2568
- 10.1. Normative References
2569
-
2570
- [ASCII] American National Standards Institute, "Coded Character
2571
- Set -- 7-bit American Standard Code for Information
2572
- Interchange", ANSI X3.4, 1986.
2573
-
2574
-
2575
-
2576
-
2577
-
2578
- Berners-Lee, et al. Standards Track [Page 46]
2579
-
2580
- RFC 3986 URI Generic Syntax January 2005
2581
-
2582
-
2583
- [RFC2234] Crocker, D. and P. Overell, "Augmented BNF for Syntax
2584
- Specifications: ABNF", RFC 2234, November 1997.
2585
-
2586
- [STD63] Yergeau, F., "UTF-8, a transformation format of
2587
- ISO 10646", STD 63, RFC 3629, November 2003.
2588
-
2589
- [UCS] International Organization for Standardization,
2590
- "Information Technology - Universal Multiple-Octet Coded
2591
- Character Set (UCS)", ISO/IEC 10646:2003, December 2003.
2592
-
2593
- 10.2. Informative References
2594
-
2595
- [BCP19] Freed, N. and J. Postel, "IANA Charset Registration
2596
- Procedures", BCP 19, RFC 2978, October 2000.
2597
-
2598
- [BCP35] Petke, R. and I. King, "Registration Procedures for URL
2599
- Scheme Names", BCP 35, RFC 2717, November 1999.
2600
-
2601
- [RFC0952] Harrenstien, K., Stahl, M., and E. Feinler, "DoD Internet
2602
- host table specification", RFC 952, October 1985.
2603
-
2604
- [RFC1034] Mockapetris, P., "Domain names - concepts and facilities",
2605
- STD 13, RFC 1034, November 1987.
2606
-
2607
- [RFC1123] Braden, R., "Requirements for Internet Hosts - Application
2608
- and Support", STD 3, RFC 1123, October 1989.
2609
-
2610
- [RFC1535] Gavron, E., "A Security Problem and Proposed Correction
2611
- With Widely Deployed DNS Software", RFC 1535,
2612
- October 1993.
2613
-
2614
- [RFC1630] Berners-Lee, T., "Universal Resource Identifiers in WWW: A
2615
- Unifying Syntax for the Expression of Names and Addresses
2616
- of Objects on the Network as used in the World-Wide Web",
2617
- RFC 1630, June 1994.
2618
-
2619
- [RFC1736] Kunze, J., "Functional Recommendations for Internet
2620
- Resource Locators", RFC 1736, February 1995.
2621
-
2622
- [RFC1737] Sollins, K. and L. Masinter, "Functional Requirements for
2623
- Uniform Resource Names", RFC 1737, December 1994.
2624
-
2625
- [RFC1738] Berners-Lee, T., Masinter, L., and M. McCahill, "Uniform
2626
- Resource Locators (URL)", RFC 1738, December 1994.
2627
-
2628
- [RFC1808] Fielding, R., "Relative Uniform Resource Locators",
2629
- RFC 1808, June 1995.
2630
-
2631
-
2632
-
2633
-
2634
- Berners-Lee, et al. Standards Track [Page 47]
2635
-
2636
- RFC 3986 URI Generic Syntax January 2005
2637
-
2638
-
2639
- [RFC2046] Freed, N. and N. Borenstein, "Multipurpose Internet Mail
2640
- Extensions (MIME) Part Two: Media Types", RFC 2046,
2641
- November 1996.
2642
-
2643
- [RFC2141] Moats, R., "URN Syntax", RFC 2141, May 1997.
2644
-
2645
- [RFC2396] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
2646
- Resource Identifiers (URI): Generic Syntax", RFC 2396,
2647
- August 1998.
2648
-
2649
- [RFC2518] Goland, Y., Whitehead, E., Faizi, A., Carter, S., and D.
2650
- Jensen, "HTTP Extensions for Distributed Authoring --
2651
- WEBDAV", RFC 2518, February 1999.
2652
-
2653
- [RFC2557] Palme, J., Hopmann, A., and N. Shelness, "MIME
2654
- Encapsulation of Aggregate Documents, such as HTML
2655
- (MHTML)", RFC 2557, March 1999.
2656
-
2657
- [RFC2718] Masinter, L., Alvestrand, H., Zigmond, D., and R. Petke,
2658
- "Guidelines for new URL Schemes", RFC 2718, November 1999.
2659
-
2660
- [RFC2732] Hinden, R., Carpenter, B., and L. Masinter, "Format for
2661
- Literal IPv6 Addresses in URL's", RFC 2732, December 1999.
2662
-
2663
- [RFC3305] Mealling, M. and R. Denenberg, "Report from the Joint
2664
- W3C/IETF URI Planning Interest Group: Uniform Resource
2665
- Identifiers (URIs), URLs, and Uniform Resource Names
2666
- (URNs): Clarifications and Recommendations", RFC 3305,
2667
- August 2002.
2668
-
2669
- [RFC3490] Faltstrom, P., Hoffman, P., and A. Costello,
2670
- "Internationalizing Domain Names in Applications (IDNA)",
2671
- RFC 3490, March 2003.
2672
-
2673
- [RFC3513] Hinden, R. and S. Deering, "Internet Protocol Version 6
2674
- (IPv6) Addressing Architecture", RFC 3513, April 2003.
2675
-
2676
- [Siedzik] Siedzik, R., "Semantic Attacks: What's in a URL?",
2677
- April 2001, <http://www.giac.org/practical/gsec/
2678
- Richard_Siedzik_GSEC.pdf>.
2679
-
2680
-
2681
-
2682
-
2683
-
2684
-
2685
-
2686
-
2687
-
2688
-
2689
-
2690
- Berners-Lee, et al. Standards Track [Page 48]
2691
-
2692
- RFC 3986 URI Generic Syntax January 2005
2693
-
2694
-
2695
- Appendix A. Collected ABNF for URI
2696
-
2697
- URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
2698
-
2699
- hier-part = "//" authority path-abempty
2700
- / path-absolute
2701
- / path-rootless
2702
- / path-empty
2703
-
2704
- URI-reference = URI / relative-ref
2705
-
2706
- absolute-URI = scheme ":" hier-part [ "?" query ]
2707
-
2708
- relative-ref = relative-part [ "?" query ] [ "#" fragment ]
2709
-
2710
- relative-part = "//" authority path-abempty
2711
- / path-absolute
2712
- / path-noscheme
2713
- / path-empty
2714
-
2715
- scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
2716
-
2717
- authority = [ userinfo "@" ] host [ ":" port ]
2718
- userinfo = *( unreserved / pct-encoded / sub-delims / ":" )
2719
- host = IP-literal / IPv4address / reg-name
2720
- port = *DIGIT
2721
-
2722
- IP-literal = "[" ( IPv6address / IPvFuture ) "]"
2723
-
2724
- IPvFuture = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )
2725
-
2726
- IPv6address = 6( h16 ":" ) ls32
2727
- / "::" 5( h16 ":" ) ls32
2728
- / [ h16 ] "::" 4( h16 ":" ) ls32
2729
- / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
2730
- / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
2731
- / [ *3( h16 ":" ) h16 ] "::" h16 ":" ls32
2732
- / [ *4( h16 ":" ) h16 ] "::" ls32
2733
- / [ *5( h16 ":" ) h16 ] "::" h16
2734
- / [ *6( h16 ":" ) h16 ] "::"
2735
-
2736
- h16 = 1*4HEXDIG
2737
- ls32 = ( h16 ":" h16 ) / IPv4address
2738
- IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet
2739
-
2740
-
2741
-
2742
-
2743
-
2744
-
2745
-
2746
- Berners-Lee, et al. Standards Track [Page 49]
2747
-
2748
- RFC 3986 URI Generic Syntax January 2005
2749
-
2750
-
2751
- dec-octet = DIGIT ; 0-9
2752
- / %x31-39 DIGIT ; 10-99
2753
- / "1" 2DIGIT ; 100-199
2754
- / "2" %x30-34 DIGIT ; 200-249
2755
- / "25" %x30-35 ; 250-255
2756
-
2757
- reg-name = *( unreserved / pct-encoded / sub-delims )
2758
-
2759
- path = path-abempty ; begins with "/" or is empty
2760
- / path-absolute ; begins with "/" but not "//"
2761
- / path-noscheme ; begins with a non-colon segment
2762
- / path-rootless ; begins with a segment
2763
- / path-empty ; zero characters
2764
-
2765
- path-abempty = *( "/" segment )
2766
- path-absolute = "/" [ segment-nz *( "/" segment ) ]
2767
- path-noscheme = segment-nz-nc *( "/" segment )
2768
- path-rootless = segment-nz *( "/" segment )
2769
- path-empty = 0<pchar>
2770
-
2771
- segment = *pchar
2772
- segment-nz = 1*pchar
2773
- segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
2774
- ; non-zero-length segment without any colon ":"
2775
-
2776
- pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
2777
-
2778
- query = *( pchar / "/" / "?" )
2779
-
2780
- fragment = *( pchar / "/" / "?" )
2781
-
2782
- pct-encoded = "%" HEXDIG HEXDIG
2783
-
2784
- unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
2785
- reserved = gen-delims / sub-delims
2786
- gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
2787
- sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
2788
- / "*" / "+" / "," / ";" / "="
2789
-
2790
- Appendix B. Parsing a URI Reference with a Regular Expression
2791
-
2792
- As the "first-match-wins" algorithm is identical to the "greedy"
2793
- disambiguation method used by POSIX regular expressions, it is
2794
- natural and commonplace to use a regular expression for parsing the
2795
- potential five components of a URI reference.
2796
-
2797
- The following line is the regular expression for breaking-down a
2798
- well-formed URI reference into its components.
2799
-
2800
-
2801
-
2802
- Berners-Lee, et al. Standards Track [Page 50]
2803
-
2804
- RFC 3986 URI Generic Syntax January 2005
2805
-
2806
-
2807
- ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
2808
- 12 3 4 5 6 7 8 9
2809
-
2810
- The numbers in the second line above are only to assist readability;
2811
- they indicate the reference points for each subexpression (i.e., each
2812
- paired parenthesis). We refer to the value matched for subexpression
2813
- <n> as $<n>. For example, matching the above expression to
2814
-
2815
- http://www.ics.uci.edu/pub/ietf/uri/#Related
2816
-
2817
- results in the following subexpression matches:
2818
-
2819
- $1 = http:
2820
- $2 = http
2821
- $3 = //www.ics.uci.edu
2822
- $4 = www.ics.uci.edu
2823
- $5 = /pub/ietf/uri/
2824
- $6 = <undefined>
2825
- $7 = <undefined>
2826
- $8 = #Related
2827
- $9 = Related
2828
-
2829
- where <undefined> indicates that the component is not present, as is
2830
- the case for the query component in the above example. Therefore, we
2831
- can determine the value of the five components as
2832
-
2833
- scheme = $2
2834
- authority = $4
2835
- path = $5
2836
- query = $7
2837
- fragment = $9
2838
-
2839
- Going in the opposite direction, we can recreate a URI reference from
2840
- its components by using the algorithm of Section 5.3.
2841
-
2842
- Appendix C. Delimiting a URI in Context
2843
-
2844
- URIs are often transmitted through formats that do not provide a
2845
- clear context for their interpretation. For example, there are many
2846
- occasions when a URI is included in plain text; examples include text
2847
- sent in email, USENET news, and on printed paper. In such cases, it
2848
- is important to be able to delimit the URI from the rest of the text,
2849
- and in particular from punctuation marks that might be mistaken for
2850
- part of the URI.
2851
-
2852
- In practice, URIs are delimited in a variety of ways, but usually
2853
- within double-quotes "http://example.com/", angle brackets
2854
- <http://example.com/>, or just by using whitespace:
2855
-
2856
-
2857
-
2858
- Berners-Lee, et al. Standards Track [Page 51]
2859
-
2860
- RFC 3986 URI Generic Syntax January 2005
2861
-
2862
-
2863
- http://example.com/
2864
-
2865
- These wrappers do not form part of the URI.
2866
-
2867
- In some cases, extra whitespace (spaces, line-breaks, tabs, etc.) may
2868
- have to be added to break a long URI across lines. The whitespace
2869
- should be ignored when the URI is extracted.
2870
-
2871
- No whitespace should be introduced after a hyphen ("-") character.
2872
- Because some typesetters and printers may (erroneously) introduce a
2873
- hyphen at the end of line when breaking it, the interpreter of a URI
2874
- containing a line break immediately after a hyphen should ignore all
2875
- whitespace around the line break and should be aware that the hyphen
2876
- may or may not actually be part of the URI.
2877
-
2878
- Using <> angle brackets around each URI is especially recommended as
2879
- a delimiting style for a reference that contains embedded whitespace.
2880
-
2881
- The prefix "URL:" (with or without a trailing space) was formerly
2882
- recommended as a way to help distinguish a URI from other bracketed
2883
- designators, though it is not commonly used in practice and is no
2884
- longer recommended.
2885
-
2886
- For robustness, software that accepts user-typed URI should attempt
2887
- to recognize and strip both delimiters and embedded whitespace.
2888
-
2889
- For example, the text
2890
-
2891
- Yes, Jim, I found it under "http://www.w3.org/Addressing/",
2892
- but you can probably pick it up from <ftp://foo.example.
2893
- com/rfc/>. Note the warning in <http://www.ics.uci.edu/pub/
2894
- ietf/uri/historical.html#WARNING>.
2895
-
2896
- contains the URI references
2897
-
2898
- http://www.w3.org/Addressing/
2899
- ftp://foo.example.com/rfc/
2900
- http://www.ics.uci.edu/pub/ietf/uri/historical.html#WARNING
2901
-
2902
-
2903
-
2904
-
2905
-
2906
-
2907
-
2908
-
2909
-
2910
-
2911
-
2912
-
2913
-
2914
- Berners-Lee, et al. Standards Track [Page 52]
2915
-
2916
- RFC 3986 URI Generic Syntax January 2005
2917
-
2918
-
2919
- Appendix D. Changes from RFC 2396
2920
-
2921
- D.1. Additions
2922
-
2923
- An ABNF rule for URI has been introduced to correspond to one common
2924
- usage of the term: an absolute URI with optional fragment.
2925
-
2926
- IPv6 (and later) literals have been added to the list of possible
2927
- identifiers for the host portion of an authority component, as
2928
- described by [RFC2732], with the addition of "[" and "]" to the
2929
- reserved set and a version flag to anticipate future versions of IP
2930
- literals. Square brackets are now specified as reserved within the
2931
- authority component and are not allowed outside their use as
2932
- delimiters for an IP literal within host. In order to make this
2933
- change without changing the technical definition of the path, query,
2934
- and fragment components, those rules were redefined to directly
2935
- specify the characters allowed.
2936
-
2937
- As [RFC2732] defers to [RFC3513] for definition of an IPv6 literal
2938
- address, which, unfortunately, lacks an ABNF description of
2939
- IPv6address, we created a new ABNF rule for IPv6address that matches
2940
- the text representations defined by Section 2.2 of [RFC3513].
2941
- Likewise, the definition of IPv4address has been improved in order to
2942
- limit each decimal octet to the range 0-255.
2943
-
2944
- Section 6, on URI normalization and comparison, has been completely
2945
- rewritten and extended by using input from Tim Bray and discussion
2946
- within the W3C Technical Architecture Group.
2947
-
2948
- D.2. Modifications
2949
-
2950
- The ad-hoc BNF syntax of RFC 2396 has been replaced with the ABNF of
2951
- [RFC2234]. This change required all rule names that formerly
2952
- included underscore characters to be renamed with a dash instead. In
2953
- addition, a number of syntax rules have been eliminated or simplified
2954
- to make the overall grammar more comprehensible. Specifications that
2955
- refer to the obsolete grammar rules may be understood by replacing
2956
- those rules according to the following table:
2957
-
2958
-
2959
-
2960
-
2961
-
2962
-
2963
-
2964
-
2965
-
2966
-
2967
-
2968
-
2969
-
2970
- Berners-Lee, et al. Standards Track [Page 53]
2971
-
2972
- RFC 3986 URI Generic Syntax January 2005
2973
-
2974
-
2975
- +----------------+--------------------------------------------------+
2976
- | obsolete rule | translation |
2977
- +----------------+--------------------------------------------------+
2978
- | absoluteURI | absolute-URI |
2979
- | relativeURI | relative-part [ "?" query ] |
2980
- | hier_part | ( "//" authority path-abempty / |
2981
- | | path-absolute ) [ "?" query ] |
2982
- | | |
2983
- | opaque_part | path-rootless [ "?" query ] |
2984
- | net_path | "//" authority path-abempty |
2985
- | abs_path | path-absolute |
2986
- | rel_path | path-rootless |
2987
- | rel_segment | segment-nz-nc |
2988
- | reg_name | reg-name |
2989
- | server | authority |
2990
- | hostport | host [ ":" port ] |
2991
- | hostname | reg-name |
2992
- | path_segments | path-abempty |
2993
- | param | *<pchar excluding ";"> |
2994
- | | |
2995
- | uric | unreserved / pct-encoded / ";" / "?" / ":" |
2996
- | | / "@" / "&" / "=" / "+" / "$" / "," / "/" |
2997
- | | |
2998
- | uric_no_slash | unreserved / pct-encoded / ";" / "?" / ":" |
2999
- | | / "@" / "&" / "=" / "+" / "$" / "," |
3000
- | | |
3001
- | mark | "-" / "_" / "." / "!" / "~" / "*" / "'" |
3002
- | | / "(" / ")" |
3003
- | | |
3004
- | escaped | pct-encoded |
3005
- | hex | HEXDIG |
3006
- | alphanum | ALPHA / DIGIT |
3007
- +----------------+--------------------------------------------------+
3008
-
3009
- Use of the above obsolete rules for the definition of scheme-specific
3010
- syntax is deprecated.
3011
-
3012
- Section 2, on characters, has been rewritten to explain what
3013
- characters are reserved, when they are reserved, and why they are
3014
- reserved, even when they are not used as delimiters by the generic
3015
- syntax. The mark characters that are typically unsafe to decode,
3016
- including the exclamation mark ("!"), asterisk ("*"), single-quote
3017
- ("'"), and open and close parentheses ("(" and ")"), have been moved
3018
- to the reserved set in order to clarify the distinction between
3019
- reserved and unreserved and, hopefully, to answer the most common
3020
- question of scheme designers. Likewise, the section on
3021
- percent-encoded characters has been rewritten, and URI normalizers
3022
- are now given license to decode any percent-encoded octets
3023
-
3024
-
3025
-
3026
- Berners-Lee, et al. Standards Track [Page 54]
3027
-
3028
- RFC 3986 URI Generic Syntax January 2005
3029
-
3030
-
3031
- corresponding to unreserved characters. In general, the terms
3032
- "escaped" and "unescaped" have been replaced with "percent-encoded"
3033
- and "decoded", respectively, to reduce confusion with other forms of
3034
- escape mechanisms.
3035
-
3036
- The ABNF for URI and URI-reference has been redesigned to make them
3037
- more friendly to LALR parsers and to reduce complexity. As a result,
3038
- the layout form of syntax description has been removed, along with
3039
- the uric, uric_no_slash, opaque_part, net_path, abs_path, rel_path,
3040
- path_segments, rel_segment, and mark rules. All references to
3041
- "opaque" URIs have been replaced with a better description of how the
3042
- path component may be opaque to hierarchy. The relativeURI rule has
3043
- been replaced with relative-ref to avoid unnecessary confusion over
3044
- whether they are a subset of URI. The ambiguity regarding the
3045
- parsing of URI-reference as a URI or a relative-ref with a colon in
3046
- the first segment has been eliminated through the use of five
3047
- separate path matching rules.
3048
-
3049
- The fragment identifier has been moved back into the section on
3050
- generic syntax components and within the URI and relative-ref rules,
3051
- though it remains excluded from absolute-URI. The number sign ("#")
3052
- character has been moved back to the reserved set as a result of
3053
- reintegrating the fragment syntax.
3054
-
3055
- The ABNF has been corrected to allow the path component to be empty.
3056
- This also allows an absolute-URI to consist of nothing after the
3057
- "scheme:", as is present in practice with the "dav:" namespace
3058
- [RFC2518] and with the "about:" scheme used internally by many WWW
3059
- browser implementations. The ambiguity regarding the boundary
3060
- between authority and path has been eliminated through the use of
3061
- five separate path matching rules.
3062
-
3063
- Registry-based naming authorities that use the generic syntax are now
3064
- defined within the host rule. This change allows current
3065
- implementations, where whatever name provided is simply fed to the
3066
- local name resolution mechanism, to be consistent with the
3067
- specification. It also removes the need to re-specify DNS name
3068
- formats here. Furthermore, it allows the host component to contain
3069
- percent-encoded octets, which is necessary to enable
3070
- internationalized domain names to be provided in URIs, processed in
3071
- their native character encodings at the application layers above URI
3072
- processing, and passed to an IDNA library as a registered name in the
3073
- UTF-8 character encoding. The server, hostport, hostname,
3074
- domainlabel, toplabel, and alphanum rules have been removed.
3075
-
3076
- The resolving relative references algorithm of [RFC2396] has been
3077
- rewritten with pseudocode for this revision to improve clarity and
3078
- fix the following issues:
3079
-
3080
-
3081
-
3082
- Berners-Lee, et al. Standards Track [Page 55]
3083
-
3084
- RFC 3986 URI Generic Syntax January 2005
3085
-
3086
-
3087
- o [RFC2396] section 5.2, step 6a, failed to account for a base URI
3088
- with no path.
3089
-
3090
- o Restored the behavior of [RFC1808] where, if the reference
3091
- contains an empty path and a defined query component, the target
3092
- URI inherits the base URI's path component.
3093
-
3094
- o The determination of whether a URI reference is a same-document
3095
- reference has been decoupled from the URI parser, simplifying the
3096
- URI processing interface within applications in a way consistent
3097
- with the internal architecture of deployed URI processing
3098
- implementations. The determination is now based on comparison to
3099
- the base URI after transforming a reference to absolute form,
3100
- rather than on the format of the reference itself. This change
3101
- may result in more references being considered "same-document"
3102
- under this specification than there would be under the rules given
3103
- in RFC 2396, especially when normalization is used to reduce
3104
- aliases. However, it does not change the status of existing
3105
- same-document references.
3106
-
3107
- o Separated the path merge routine into two routines: merge, for
3108
- describing combination of the base URI path with a relative-path
3109
- reference, and remove_dot_segments, for describing how to remove
3110
- the special "." and ".." segments from a composed path. The
3111
- remove_dot_segments algorithm is now applied to all URI reference
3112
- paths in order to match common implementations and to improve the
3113
- normalization of URIs in practice. This change only impacts the
3114
- parsing of abnormal references and same-scheme references wherein
3115
- the base URI has a non-hierarchical path.
3116
-
3117
- Index
3118
-
3119
- A
3120
- ABNF 11
3121
- absolute 27
3122
- absolute-path 26
3123
- absolute-URI 27
3124
- access 9
3125
- authority 17, 18
3126
-
3127
- B
3128
- base URI 28
3129
-
3130
- C
3131
- character encoding 4
3132
- character 4
3133
- characters 8, 11
3134
- coded character set 4
3135
-
3136
-
3137
-
3138
- Berners-Lee, et al. Standards Track [Page 56]
3139
-
3140
- RFC 3986 URI Generic Syntax January 2005
3141
-
3142
-
3143
- D
3144
- dec-octet 20
3145
- dereference 9
3146
- dot-segments 23
3147
-
3148
- F
3149
- fragment 16, 24
3150
-
3151
- G
3152
- gen-delims 13
3153
- generic syntax 6
3154
-
3155
- H
3156
- h16 20
3157
- hier-part 16
3158
- hierarchical 10
3159
- host 18
3160
-
3161
- I
3162
- identifier 5
3163
- IP-literal 19
3164
- IPv4 20
3165
- IPv4address 19, 20
3166
- IPv6 19
3167
- IPv6address 19, 20
3168
- IPvFuture 19
3169
-
3170
- L
3171
- locator 7
3172
- ls32 20
3173
-
3174
- M
3175
- merge 32
3176
-
3177
- N
3178
- name 7
3179
- network-path 26
3180
-
3181
- P
3182
- path 16, 22, 26
3183
- path-abempty 22
3184
- path-absolute 22
3185
- path-empty 22
3186
- path-noscheme 22
3187
- path-rootless 22
3188
- path-abempty 16, 22, 26
3189
- path-absolute 16, 22, 26
3190
- path-empty 16, 22, 26
3191
-
3192
-
3193
-
3194
- Berners-Lee, et al. Standards Track [Page 57]
3195
-
3196
- RFC 3986 URI Generic Syntax January 2005
3197
-
3198
-
3199
- path-rootless 16, 22
3200
- pchar 23
3201
- pct-encoded 12
3202
- percent-encoding 12
3203
- port 22
3204
-
3205
- Q
3206
- query 16, 23
3207
-
3208
- R
3209
- reg-name 21
3210
- registered name 20
3211
- relative 10, 28
3212
- relative-path 26
3213
- relative-ref 26
3214
- remove_dot_segments 33
3215
- representation 9
3216
- reserved 12
3217
- resolution 9, 28
3218
- resource 5
3219
- retrieval 9
3220
-
3221
- S
3222
- same-document 27
3223
- sameness 9
3224
- scheme 16, 17
3225
- segment 22, 23
3226
- segment-nz 23
3227
- segment-nz-nc 23
3228
- sub-delims 13
3229
- suffix 27
3230
-
3231
- T
3232
- transcription 8
3233
-
3234
- U
3235
- uniform 4
3236
- unreserved 13
3237
- URI grammar
3238
- absolute-URI 27
3239
- ALPHA 11
3240
- authority 18
3241
- CR 11
3242
- dec-octet 20
3243
- DIGIT 11
3244
- DQUOTE 11
3245
- fragment 24
3246
- gen-delims 13
3247
-
3248
-
3249
-
3250
- Berners-Lee, et al. Standards Track [Page 58]
3251
-
3252
- RFC 3986 URI Generic Syntax January 2005
3253
-
3254
-
3255
- h16 20
3256
- HEXDIG 11
3257
- hier-part 16
3258
- host 19
3259
- IP-literal 19
3260
- IPv4address 20
3261
- IPv6address 20
3262
- IPvFuture 19
3263
- LF 11
3264
- ls32 20
3265
- OCTET 11
3266
- path 22
3267
- path-abempty 22
3268
- path-absolute 22
3269
- path-empty 22
3270
- path-noscheme 22
3271
- path-rootless 22
3272
- pchar 23
3273
- pct-encoded 12
3274
- port 22
3275
- query 24
3276
- reg-name 21
3277
- relative-ref 26
3278
- reserved 13
3279
- scheme 17
3280
- segment 23
3281
- segment-nz 23
3282
- segment-nz-nc 23
3283
- SP 11
3284
- sub-delims 13
3285
- unreserved 13
3286
- URI 16
3287
- URI-reference 25
3288
- userinfo 18
3289
- URI 16
3290
- URI-reference 25
3291
- URL 7
3292
- URN 7
3293
- userinfo 18
3294
-
3295
-
3296
-
3297
-
3298
-
3299
-
3300
-
3301
-
3302
-
3303
-
3304
-
3305
-
3306
- Berners-Lee, et al. Standards Track [Page 59]
3307
-
3308
- RFC 3986 URI Generic Syntax January 2005
3309
-
3310
-
3311
- Authors' Addresses
3312
-
3313
- Tim Berners-Lee
3314
- World Wide Web Consortium
3315
- Massachusetts Institute of Technology
3316
- 77 Massachusetts Avenue
3317
- Cambridge, MA 02139
3318
- USA
3319
-
3320
- Phone: +1-617-253-5702
3321
- Fax: +1-617-258-5999
3322
- EMail: timbl@w3.org
3323
- URI: http://www.w3.org/People/Berners-Lee/
3324
-
3325
-
3326
- Roy T. Fielding
3327
- Day Software
3328
- 5251 California Ave., Suite 110
3329
- Irvine, CA 92617
3330
- USA
3331
-
3332
- Phone: +1-949-679-2960
3333
- Fax: +1-949-679-2972
3334
- EMail: fielding@gbiv.com
3335
- URI: http://roy.gbiv.com/
3336
-
3337
-
3338
- Larry Masinter
3339
- Adobe Systems Incorporated
3340
- 345 Park Ave
3341
- San Jose, CA 95110
3342
- USA
3343
-
3344
- Phone: +1-408-536-3024
3345
- EMail: LMM@acm.org
3346
- URI: http://larry.masinter.net/
3347
-
3348
-
3349
-
3350
-
3351
-
3352
-
3353
-
3354
-
3355
-
3356
-
3357
-
3358
-
3359
-
3360
-
3361
-
3362
- Berners-Lee, et al. Standards Track [Page 60]
3363
-
3364
- RFC 3986 URI Generic Syntax January 2005
3365
-
3366
-
3367
- Full Copyright Statement
3368
-
3369
- Copyright (C) The Internet Society (2005).
3370
-
3371
- This document is subject to the rights, licenses and restrictions
3372
- contained in BCP 78, and except as set forth therein, the authors
3373
- retain all their rights.
3374
-
3375
- This document and the information contained herein are provided on an
3376
- "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
3377
- OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
3378
- ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
3379
- INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
3380
- INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
3381
- WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
3382
-
3383
- Intellectual Property
3384
-
3385
- The IETF takes no position regarding the validity or scope of any
3386
- Intellectual Property Rights or other rights that might be claimed to
3387
- pertain to the implementation or use of the technology described in
3388
- this document or the extent to which any license under such rights
3389
- might or might not be available; nor does it represent that it has
3390
- made any independent effort to identify any such rights. Information
3391
- on the IETF's procedures with respect to rights in IETF Documents can
3392
- be found in BCP 78 and BCP 79.
3393
-
3394
- Copies of IPR disclosures made to the IETF Secretariat and any
3395
- assurances of licenses to be made available, or the result of an
3396
- attempt made to obtain a general license or permission for the use of
3397
- such proprietary rights by implementers or users of this
3398
- specification can be obtained from the IETF on-line IPR repository at
3399
- http://www.ietf.org/ipr.
3400
-
3401
- The IETF invites any interested party to bring to its attention any
3402
- copyrights, patents or patent applications, or other proprietary
3403
- rights that may cover technology that may be required to implement
3404
- this standard. Please address the information to the IETF at ietf-
3405
- ipr@ietf.org.
3406
-
3407
-
3408
- Acknowledgement
3409
-
3410
- Funding for the RFC Editor function is currently provided by the
3411
- Internet Society.
3412
-
3413
-
3414
-
3415
-
3416
-
3417
-
3418
- Berners-Lee, et al. Standards Track [Page 61]
3419
-