cerberus 0.7.6 → 0.7.7
Sign up to get free protection for your applications and to get access to all the features.
- data/Changelog.txt +21 -1
- data/Rakefile +2 -2
- data/lib/cerberus/cli.rb +1 -1
- data/lib/cerberus/constants.rb +1 -1
- data/lib/cerberus/manager.rb +5 -5
- data/lib/cerberus/publisher/base.rb +1 -1
- data/lib/cerberus/publisher/irc.rb +8 -13
- data/lib/cerberus/publisher/mail.rb +1 -0
- data/lib/cerberus/scm/git.rb +1 -1
- data/lib/vendor/addressable/CHANGELOG +88 -0
- data/lib/vendor/addressable/LICENSE +20 -0
- data/lib/vendor/addressable/README +50 -0
- data/lib/vendor/addressable/Rakefile +74 -0
- data/lib/vendor/addressable/addressable.gemspec +30 -0
- data/lib/vendor/addressable/lib/addressable/idna.rb +4871 -0
- data/lib/vendor/addressable/lib/addressable/template.rb +1049 -0
- data/lib/vendor/addressable/lib/addressable/uri.rb +2078 -0
- data/lib/vendor/addressable/lib/addressable/version.rb +36 -0
- data/lib/vendor/addressable/spec/addressable/idna_spec.rb +194 -0
- data/lib/vendor/addressable/spec/addressable/template_spec.rb +2152 -0
- data/lib/vendor/addressable/spec/addressable/uri_spec.rb +3914 -0
- data/lib/vendor/addressable/spec/data/rfc3986.txt +3419 -0
- data/lib/vendor/addressable/tasks/clobber.rake +2 -0
- data/lib/vendor/addressable/tasks/gem.rake +68 -0
- data/lib/vendor/addressable/tasks/git.rake +40 -0
- data/lib/vendor/addressable/tasks/metrics.rake +22 -0
- data/lib/vendor/addressable/tasks/rdoc.rake +29 -0
- data/lib/vendor/addressable/tasks/rubyforge.rake +89 -0
- data/lib/vendor/addressable/tasks/spec.rake +47 -0
- data/lib/vendor/addressable/website/index.html +110 -0
- data/lib/vendor/shout-bot/TODO +2 -0
- data/lib/vendor/shout-bot/lib/shout-bot.rb +76 -0
- data/lib/vendor/shout-bot/shout-bot.gemspec +13 -0
- data/lib/vendor/tinder/CHANGELOG.txt +39 -0
- data/lib/vendor/tinder/Manifest.txt +10 -0
- data/lib/vendor/tinder/README.txt +48 -0
- data/lib/vendor/tinder/Rakefile +45 -0
- data/lib/vendor/tinder/VERSION +1 -0
- data/lib/vendor/tinder/init.rb +1 -0
- data/lib/vendor/tinder/lib/tinder.rb +12 -0
- data/lib/vendor/tinder/lib/tinder/campfire.rb +75 -0
- data/lib/vendor/tinder/lib/tinder/connection.rb +74 -0
- data/lib/vendor/tinder/lib/tinder/multipart.rb +63 -0
- data/lib/vendor/tinder/lib/tinder/room.rb +225 -0
- data/lib/vendor/tinder/site/index.html +101 -0
- data/lib/vendor/tinder/site/stylesheets/style.css +77 -0
- data/lib/vendor/tinder/spec/fixtures/rooms.json +18 -0
- data/lib/vendor/tinder/spec/fixtures/rooms/room80749.json +21 -0
- data/lib/vendor/tinder/spec/fixtures/rooms/room80751.json +21 -0
- data/lib/vendor/tinder/spec/fixtures/rooms/show.json +21 -0
- data/lib/vendor/tinder/spec/fixtures/users/me.json +11 -0
- data/lib/vendor/tinder/spec/spec.opts +2 -0
- data/lib/vendor/tinder/spec/spec_helper.rb +12 -0
- data/lib/vendor/tinder/spec/tinder/campfire_spec.rb +53 -0
- data/lib/vendor/tinder/spec/tinder/connection_spec.rb +29 -0
- data/lib/vendor/tinder/spec/tinder/room_spec.rb +98 -0
- data/lib/vendor/tinder/tinder.gemspec +79 -0
- data/test/functional_test.rb +19 -4
- data/test/integration_test.rb +1 -1
- data/test/irc_publisher_test.rb +4 -4
- data/test/mock/manager.rb +1 -1
- metadata +58 -20
- data/lib/vendor/irc/README +0 -23
- data/lib/vendor/irc/lib/IRC.rb +0 -164
- data/lib/vendor/irc/lib/IRCChannel.rb +0 -33
- data/lib/vendor/irc/lib/IRCConnection.rb +0 -134
- data/lib/vendor/irc/lib/IRCEvent.rb +0 -91
- data/lib/vendor/irc/lib/IRCUser.rb +0 -23
- data/lib/vendor/irc/lib/IRCUtil.rb +0 -49
- data/lib/vendor/irc/lib/eventmap.yml +0 -247
@@ -0,0 +1,3419 @@
|
|
1
|
+
|
2
|
+
|
3
|
+
|
4
|
+
|
5
|
+
|
6
|
+
|
7
|
+
Network Working Group T. Berners-Lee
|
8
|
+
Request for Comments: 3986 W3C/MIT
|
9
|
+
STD: 66 R. Fielding
|
10
|
+
Updates: 1738 Day Software
|
11
|
+
Obsoletes: 2732, 2396, 1808 L. Masinter
|
12
|
+
Category: Standards Track Adobe Systems
|
13
|
+
January 2005
|
14
|
+
|
15
|
+
|
16
|
+
Uniform Resource Identifier (URI): Generic Syntax
|
17
|
+
|
18
|
+
Status of This Memo
|
19
|
+
|
20
|
+
This document specifies an Internet standards track protocol for the
|
21
|
+
Internet community, and requests discussion and suggestions for
|
22
|
+
improvements. Please refer to the current edition of the "Internet
|
23
|
+
Official Protocol Standards" (STD 1) for the standardization state
|
24
|
+
and status of this protocol. Distribution of this memo is unlimited.
|
25
|
+
|
26
|
+
Copyright Notice
|
27
|
+
|
28
|
+
Copyright (C) The Internet Society (2005).
|
29
|
+
|
30
|
+
Abstract
|
31
|
+
|
32
|
+
A Uniform Resource Identifier (URI) is a compact sequence of
|
33
|
+
characters that identifies an abstract or physical resource. This
|
34
|
+
specification defines the generic URI syntax and a process for
|
35
|
+
resolving URI references that might be in relative form, along with
|
36
|
+
guidelines and security considerations for the use of URIs on the
|
37
|
+
Internet. The URI syntax defines a grammar that is a superset of all
|
38
|
+
valid URIs, allowing an implementation to parse the common components
|
39
|
+
of a URI reference without knowing the scheme-specific requirements
|
40
|
+
of every possible identifier. This specification does not define a
|
41
|
+
generative grammar for URIs; that task is performed by the individual
|
42
|
+
specifications of each URI scheme.
|
43
|
+
|
44
|
+
|
45
|
+
|
46
|
+
|
47
|
+
|
48
|
+
|
49
|
+
|
50
|
+
|
51
|
+
|
52
|
+
|
53
|
+
|
54
|
+
|
55
|
+
|
56
|
+
|
57
|
+
|
58
|
+
Berners-Lee, et al. Standards Track [Page 1]
|
59
|
+
|
60
|
+
RFC 3986 URI Generic Syntax January 2005
|
61
|
+
|
62
|
+
|
63
|
+
Table of Contents
|
64
|
+
|
65
|
+
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4
|
66
|
+
1.1. Overview of URIs . . . . . . . . . . . . . . . . . . . . 4
|
67
|
+
1.1.1. Generic Syntax . . . . . . . . . . . . . . . . . 6
|
68
|
+
1.1.2. Examples . . . . . . . . . . . . . . . . . . . . 7
|
69
|
+
1.1.3. URI, URL, and URN . . . . . . . . . . . . . . . 7
|
70
|
+
1.2. Design Considerations . . . . . . . . . . . . . . . . . 8
|
71
|
+
1.2.1. Transcription . . . . . . . . . . . . . . . . . 8
|
72
|
+
1.2.2. Separating Identification from Interaction . . . 9
|
73
|
+
1.2.3. Hierarchical Identifiers . . . . . . . . . . . . 10
|
74
|
+
1.3. Syntax Notation . . . . . . . . . . . . . . . . . . . . 11
|
75
|
+
2. Characters . . . . . . . . . . . . . . . . . . . . . . . . . . 11
|
76
|
+
2.1. Percent-Encoding . . . . . . . . . . . . . . . . . . . . 12
|
77
|
+
2.2. Reserved Characters . . . . . . . . . . . . . . . . . . 12
|
78
|
+
2.3. Unreserved Characters . . . . . . . . . . . . . . . . . 13
|
79
|
+
2.4. When to Encode or Decode . . . . . . . . . . . . . . . . 14
|
80
|
+
2.5. Identifying Data . . . . . . . . . . . . . . . . . . . . 14
|
81
|
+
3. Syntax Components . . . . . . . . . . . . . . . . . . . . . . 16
|
82
|
+
3.1. Scheme . . . . . . . . . . . . . . . . . . . . . . . . . 17
|
83
|
+
3.2. Authority . . . . . . . . . . . . . . . . . . . . . . . 17
|
84
|
+
3.2.1. User Information . . . . . . . . . . . . . . . . 18
|
85
|
+
3.2.2. Host . . . . . . . . . . . . . . . . . . . . . . 18
|
86
|
+
3.2.3. Port . . . . . . . . . . . . . . . . . . . . . . 22
|
87
|
+
3.3. Path . . . . . . . . . . . . . . . . . . . . . . . . . . 22
|
88
|
+
3.4. Query . . . . . . . . . . . . . . . . . . . . . . . . . 23
|
89
|
+
3.5. Fragment . . . . . . . . . . . . . . . . . . . . . . . . 24
|
90
|
+
4. Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
|
91
|
+
4.1. URI Reference . . . . . . . . . . . . . . . . . . . . . 25
|
92
|
+
4.2. Relative Reference . . . . . . . . . . . . . . . . . . . 26
|
93
|
+
4.3. Absolute URI . . . . . . . . . . . . . . . . . . . . . . 27
|
94
|
+
4.4. Same-Document Reference . . . . . . . . . . . . . . . . 27
|
95
|
+
4.5. Suffix Reference . . . . . . . . . . . . . . . . . . . . 27
|
96
|
+
5. Reference Resolution . . . . . . . . . . . . . . . . . . . . . 28
|
97
|
+
5.1. Establishing a Base URI . . . . . . . . . . . . . . . . 28
|
98
|
+
5.1.1. Base URI Embedded in Content . . . . . . . . . . 29
|
99
|
+
5.1.2. Base URI from the Encapsulating Entity . . . . . 29
|
100
|
+
5.1.3. Base URI from the Retrieval URI . . . . . . . . 30
|
101
|
+
5.1.4. Default Base URI . . . . . . . . . . . . . . . . 30
|
102
|
+
5.2. Relative Resolution . . . . . . . . . . . . . . . . . . 30
|
103
|
+
5.2.1. Pre-parse the Base URI . . . . . . . . . . . . . 31
|
104
|
+
5.2.2. Transform References . . . . . . . . . . . . . . 31
|
105
|
+
5.2.3. Merge Paths . . . . . . . . . . . . . . . . . . 32
|
106
|
+
5.2.4. Remove Dot Segments . . . . . . . . . . . . . . 33
|
107
|
+
5.3. Component Recomposition . . . . . . . . . . . . . . . . 35
|
108
|
+
5.4. Reference Resolution Examples . . . . . . . . . . . . . 35
|
109
|
+
5.4.1. Normal Examples . . . . . . . . . . . . . . . . 36
|
110
|
+
5.4.2. Abnormal Examples . . . . . . . . . . . . . . . 36
|
111
|
+
|
112
|
+
|
113
|
+
|
114
|
+
Berners-Lee, et al. Standards Track [Page 2]
|
115
|
+
|
116
|
+
RFC 3986 URI Generic Syntax January 2005
|
117
|
+
|
118
|
+
|
119
|
+
6. Normalization and Comparison . . . . . . . . . . . . . . . . . 38
|
120
|
+
6.1. Equivalence . . . . . . . . . . . . . . . . . . . . . . 38
|
121
|
+
6.2. Comparison Ladder . . . . . . . . . . . . . . . . . . . 39
|
122
|
+
6.2.1. Simple String Comparison . . . . . . . . . . . . 39
|
123
|
+
6.2.2. Syntax-Based Normalization . . . . . . . . . . . 40
|
124
|
+
6.2.3. Scheme-Based Normalization . . . . . . . . . . . 41
|
125
|
+
6.2.4. Protocol-Based Normalization . . . . . . . . . . 42
|
126
|
+
7. Security Considerations . . . . . . . . . . . . . . . . . . . 43
|
127
|
+
7.1. Reliability and Consistency . . . . . . . . . . . . . . 43
|
128
|
+
7.2. Malicious Construction . . . . . . . . . . . . . . . . . 43
|
129
|
+
7.3. Back-End Transcoding . . . . . . . . . . . . . . . . . . 44
|
130
|
+
7.4. Rare IP Address Formats . . . . . . . . . . . . . . . . 45
|
131
|
+
7.5. Sensitive Information . . . . . . . . . . . . . . . . . 45
|
132
|
+
7.6. Semantic Attacks . . . . . . . . . . . . . . . . . . . . 45
|
133
|
+
8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 46
|
134
|
+
9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 46
|
135
|
+
10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 46
|
136
|
+
10.1. Normative References . . . . . . . . . . . . . . . . . . 46
|
137
|
+
10.2. Informative References . . . . . . . . . . . . . . . . . 47
|
138
|
+
A. Collected ABNF for URI . . . . . . . . . . . . . . . . . . . . 49
|
139
|
+
B. Parsing a URI Reference with a Regular Expression . . . . . . 50
|
140
|
+
C. Delimiting a URI in Context . . . . . . . . . . . . . . . . . 51
|
141
|
+
D. Changes from RFC 2396 . . . . . . . . . . . . . . . . . . . . 53
|
142
|
+
D.1. Additions . . . . . . . . . . . . . . . . . . . . . . . 53
|
143
|
+
D.2. Modifications . . . . . . . . . . . . . . . . . . . . . 53
|
144
|
+
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
|
145
|
+
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 60
|
146
|
+
Full Copyright Statement . . . . . . . . . . . . . . . . . . . . . 61
|
147
|
+
|
148
|
+
|
149
|
+
|
150
|
+
|
151
|
+
|
152
|
+
|
153
|
+
|
154
|
+
|
155
|
+
|
156
|
+
|
157
|
+
|
158
|
+
|
159
|
+
|
160
|
+
|
161
|
+
|
162
|
+
|
163
|
+
|
164
|
+
|
165
|
+
|
166
|
+
|
167
|
+
|
168
|
+
|
169
|
+
|
170
|
+
Berners-Lee, et al. Standards Track [Page 3]
|
171
|
+
|
172
|
+
RFC 3986 URI Generic Syntax January 2005
|
173
|
+
|
174
|
+
|
175
|
+
1. Introduction
|
176
|
+
|
177
|
+
A Uniform Resource Identifier (URI) provides a simple and extensible
|
178
|
+
means for identifying a resource. This specification of URI syntax
|
179
|
+
and semantics is derived from concepts introduced by the World Wide
|
180
|
+
Web global information initiative, whose use of these identifiers
|
181
|
+
dates from 1990 and is described in "Universal Resource Identifiers
|
182
|
+
in WWW" [RFC1630]. The syntax is designed to meet the
|
183
|
+
recommendations laid out in "Functional Recommendations for Internet
|
184
|
+
Resource Locators" [RFC1736] and "Functional Requirements for Uniform
|
185
|
+
Resource Names" [RFC1737].
|
186
|
+
|
187
|
+
This document obsoletes [RFC2396], which merged "Uniform Resource
|
188
|
+
Locators" [RFC1738] and "Relative Uniform Resource Locators"
|
189
|
+
[RFC1808] in order to define a single, generic syntax for all URIs.
|
190
|
+
It obsoletes [RFC2732], which introduced syntax for an IPv6 address.
|
191
|
+
It excludes portions of RFC 1738 that defined the specific syntax of
|
192
|
+
individual URI schemes; those portions will be updated as separate
|
193
|
+
documents. The process for registration of new URI schemes is
|
194
|
+
defined separately by [BCP35]. Advice for designers of new URI
|
195
|
+
schemes can be found in [RFC2718]. All significant changes from RFC
|
196
|
+
2396 are noted in Appendix D.
|
197
|
+
|
198
|
+
This specification uses the terms "character" and "coded character
|
199
|
+
set" in accordance with the definitions provided in [BCP19], and
|
200
|
+
"character encoding" in place of what [BCP19] refers to as a
|
201
|
+
"charset".
|
202
|
+
|
203
|
+
1.1. Overview of URIs
|
204
|
+
|
205
|
+
URIs are characterized as follows:
|
206
|
+
|
207
|
+
Uniform
|
208
|
+
|
209
|
+
Uniformity provides several benefits. It allows different types
|
210
|
+
of resource identifiers to be used in the same context, even when
|
211
|
+
the mechanisms used to access those resources may differ. It
|
212
|
+
allows uniform semantic interpretation of common syntactic
|
213
|
+
conventions across different types of resource identifiers. It
|
214
|
+
allows introduction of new types of resource identifiers without
|
215
|
+
interfering with the way that existing identifiers are used. It
|
216
|
+
allows the identifiers to be reused in many different contexts,
|
217
|
+
thus permitting new applications or protocols to leverage a pre-
|
218
|
+
existing, large, and widely used set of resource identifiers.
|
219
|
+
|
220
|
+
|
221
|
+
|
222
|
+
|
223
|
+
|
224
|
+
|
225
|
+
|
226
|
+
Berners-Lee, et al. Standards Track [Page 4]
|
227
|
+
|
228
|
+
RFC 3986 URI Generic Syntax January 2005
|
229
|
+
|
230
|
+
|
231
|
+
Resource
|
232
|
+
|
233
|
+
This specification does not limit the scope of what might be a
|
234
|
+
resource; rather, the term "resource" is used in a general sense
|
235
|
+
for whatever might be identified by a URI. Familiar examples
|
236
|
+
include an electronic document, an image, a source of information
|
237
|
+
with a consistent purpose (e.g., "today's weather report for Los
|
238
|
+
Angeles"), a service (e.g., an HTTP-to-SMS gateway), and a
|
239
|
+
collection of other resources. A resource is not necessarily
|
240
|
+
accessible via the Internet; e.g., human beings, corporations, and
|
241
|
+
bound books in a library can also be resources. Likewise,
|
242
|
+
abstract concepts can be resources, such as the operators and
|
243
|
+
operands of a mathematical equation, the types of a relationship
|
244
|
+
(e.g., "parent" or "employee"), or numeric values (e.g., zero,
|
245
|
+
one, and infinity).
|
246
|
+
|
247
|
+
Identifier
|
248
|
+
|
249
|
+
An identifier embodies the information required to distinguish
|
250
|
+
what is being identified from all other things within its scope of
|
251
|
+
identification. Our use of the terms "identify" and "identifying"
|
252
|
+
refer to this purpose of distinguishing one resource from all
|
253
|
+
other resources, regardless of how that purpose is accomplished
|
254
|
+
(e.g., by name, address, or context). These terms should not be
|
255
|
+
mistaken as an assumption that an identifier defines or embodies
|
256
|
+
the identity of what is referenced, though that may be the case
|
257
|
+
for some identifiers. Nor should it be assumed that a system
|
258
|
+
using URIs will access the resource identified: in many cases,
|
259
|
+
URIs are used to denote resources without any intention that they
|
260
|
+
be accessed. Likewise, the "one" resource identified might not be
|
261
|
+
singular in nature (e.g., a resource might be a named set or a
|
262
|
+
mapping that varies over time).
|
263
|
+
|
264
|
+
A URI is an identifier consisting of a sequence of characters
|
265
|
+
matching the syntax rule named <URI> in Section 3. It enables
|
266
|
+
uniform identification of resources via a separately defined
|
267
|
+
extensible set of naming schemes (Section 3.1). How that
|
268
|
+
identification is accomplished, assigned, or enabled is delegated to
|
269
|
+
each scheme specification.
|
270
|
+
|
271
|
+
This specification does not place any limits on the nature of a
|
272
|
+
resource, the reasons why an application might seek to refer to a
|
273
|
+
resource, or the kinds of systems that might use URIs for the sake of
|
274
|
+
identifying resources. This specification does not require that a
|
275
|
+
URI persists in identifying the same resource over time, though that
|
276
|
+
is a common goal of all URI schemes. Nevertheless, nothing in this
|
277
|
+
|
278
|
+
|
279
|
+
|
280
|
+
|
281
|
+
|
282
|
+
Berners-Lee, et al. Standards Track [Page 5]
|
283
|
+
|
284
|
+
RFC 3986 URI Generic Syntax January 2005
|
285
|
+
|
286
|
+
|
287
|
+
specification prevents an application from limiting itself to
|
288
|
+
particular types of resources, or to a subset of URIs that maintains
|
289
|
+
characteristics desired by that application.
|
290
|
+
|
291
|
+
URIs have a global scope and are interpreted consistently regardless
|
292
|
+
of context, though the result of that interpretation may be in
|
293
|
+
relation to the end-user's context. For example, "http://localhost/"
|
294
|
+
has the same interpretation for every user of that reference, even
|
295
|
+
though the network interface corresponding to "localhost" may be
|
296
|
+
different for each end-user: interpretation is independent of access.
|
297
|
+
However, an action made on the basis of that reference will take
|
298
|
+
place in relation to the end-user's context, which implies that an
|
299
|
+
action intended to refer to a globally unique thing must use a URI
|
300
|
+
that distinguishes that resource from all other things. URIs that
|
301
|
+
identify in relation to the end-user's local context should only be
|
302
|
+
used when the context itself is a defining aspect of the resource,
|
303
|
+
such as when an on-line help manual refers to a file on the end-
|
304
|
+
user's file system (e.g., "file:///etc/hosts").
|
305
|
+
|
306
|
+
1.1.1. Generic Syntax
|
307
|
+
|
308
|
+
Each URI begins with a scheme name, as defined in Section 3.1, that
|
309
|
+
refers to a specification for assigning identifiers within that
|
310
|
+
scheme. As such, the URI syntax is a federated and extensible naming
|
311
|
+
system wherein each scheme's specification may further restrict the
|
312
|
+
syntax and semantics of identifiers using that scheme.
|
313
|
+
|
314
|
+
This specification defines those elements of the URI syntax that are
|
315
|
+
required of all URI schemes or are common to many URI schemes. It
|
316
|
+
thus defines the syntax and semantics needed to implement a scheme-
|
317
|
+
independent parsing mechanism for URI references, by which the
|
318
|
+
scheme-dependent handling of a URI can be postponed until the
|
319
|
+
scheme-dependent semantics are needed. Likewise, protocols and data
|
320
|
+
formats that make use of URI references can refer to this
|
321
|
+
specification as a definition for the range of syntax allowed for all
|
322
|
+
URIs, including those schemes that have yet to be defined. This
|
323
|
+
decouples the evolution of identification schemes from the evolution
|
324
|
+
of protocols, data formats, and implementations that make use of
|
325
|
+
URIs.
|
326
|
+
|
327
|
+
A parser of the generic URI syntax can parse any URI reference into
|
328
|
+
its major components. Once the scheme is determined, further
|
329
|
+
scheme-specific parsing can be performed on the components. In other
|
330
|
+
words, the URI generic syntax is a superset of the syntax of all URI
|
331
|
+
schemes.
|
332
|
+
|
333
|
+
|
334
|
+
|
335
|
+
|
336
|
+
|
337
|
+
|
338
|
+
Berners-Lee, et al. Standards Track [Page 6]
|
339
|
+
|
340
|
+
RFC 3986 URI Generic Syntax January 2005
|
341
|
+
|
342
|
+
|
343
|
+
1.1.2. Examples
|
344
|
+
|
345
|
+
The following example URIs illustrate several URI schemes and
|
346
|
+
variations in their common syntax components:
|
347
|
+
|
348
|
+
ftp://ftp.is.co.za/rfc/rfc1808.txt
|
349
|
+
|
350
|
+
http://www.ietf.org/rfc/rfc2396.txt
|
351
|
+
|
352
|
+
ldap://[2001:db8::7]/c=GB?objectClass?one
|
353
|
+
|
354
|
+
mailto:John.Doe@example.com
|
355
|
+
|
356
|
+
news:comp.infosystems.www.servers.unix
|
357
|
+
|
358
|
+
tel:+1-816-555-1212
|
359
|
+
|
360
|
+
telnet://192.0.2.16:80/
|
361
|
+
|
362
|
+
urn:oasis:names:specification:docbook:dtd:xml:4.1.2
|
363
|
+
|
364
|
+
|
365
|
+
1.1.3. URI, URL, and URN
|
366
|
+
|
367
|
+
A URI can be further classified as a locator, a name, or both. The
|
368
|
+
term "Uniform Resource Locator" (URL) refers to the subset of URIs
|
369
|
+
that, in addition to identifying a resource, provide a means of
|
370
|
+
locating the resource by describing its primary access mechanism
|
371
|
+
(e.g., its network "location"). The term "Uniform Resource Name"
|
372
|
+
(URN) has been used historically to refer to both URIs under the
|
373
|
+
"urn" scheme [RFC2141], which are required to remain globally unique
|
374
|
+
and persistent even when the resource ceases to exist or becomes
|
375
|
+
unavailable, and to any other URI with the properties of a name.
|
376
|
+
|
377
|
+
An individual scheme does not have to be classified as being just one
|
378
|
+
of "name" or "locator". Instances of URIs from any given scheme may
|
379
|
+
have the characteristics of names or locators or both, often
|
380
|
+
depending on the persistence and care in the assignment of
|
381
|
+
identifiers by the naming authority, rather than on any quality of
|
382
|
+
the scheme. Future specifications and related documentation should
|
383
|
+
use the general term "URI" rather than the more restrictive terms
|
384
|
+
"URL" and "URN" [RFC3305].
|
385
|
+
|
386
|
+
|
387
|
+
|
388
|
+
|
389
|
+
|
390
|
+
|
391
|
+
|
392
|
+
|
393
|
+
|
394
|
+
Berners-Lee, et al. Standards Track [Page 7]
|
395
|
+
|
396
|
+
RFC 3986 URI Generic Syntax January 2005
|
397
|
+
|
398
|
+
|
399
|
+
1.2. Design Considerations
|
400
|
+
|
401
|
+
1.2.1. Transcription
|
402
|
+
|
403
|
+
The URI syntax has been designed with global transcription as one of
|
404
|
+
its main considerations. A URI is a sequence of characters from a
|
405
|
+
very limited set: the letters of the basic Latin alphabet, digits,
|
406
|
+
and a few special characters. A URI may be represented in a variety
|
407
|
+
of ways; e.g., ink on paper, pixels on a screen, or a sequence of
|
408
|
+
character encoding octets. The interpretation of a URI depends only
|
409
|
+
on the characters used and not on how those characters are
|
410
|
+
represented in a network protocol.
|
411
|
+
|
412
|
+
The goal of transcription can be described by a simple scenario.
|
413
|
+
Imagine two colleagues, Sam and Kim, sitting in a pub at an
|
414
|
+
international conference and exchanging research ideas. Sam asks Kim
|
415
|
+
for a location to get more information, so Kim writes the URI for the
|
416
|
+
research site on a napkin. Upon returning home, Sam takes out the
|
417
|
+
napkin and types the URI into a computer, which then retrieves the
|
418
|
+
information to which Kim referred.
|
419
|
+
|
420
|
+
There are several design considerations revealed by the scenario:
|
421
|
+
|
422
|
+
o A URI is a sequence of characters that is not always represented
|
423
|
+
as a sequence of octets.
|
424
|
+
|
425
|
+
o A URI might be transcribed from a non-network source and thus
|
426
|
+
should consist of characters that are most likely able to be
|
427
|
+
entered into a computer, within the constraints imposed by
|
428
|
+
keyboards (and related input devices) across languages and
|
429
|
+
locales.
|
430
|
+
|
431
|
+
o A URI often has to be remembered by people, and it is easier for
|
432
|
+
people to remember a URI when it consists of meaningful or
|
433
|
+
familiar components.
|
434
|
+
|
435
|
+
These design considerations are not always in alignment. For
|
436
|
+
example, it is often the case that the most meaningful name for a URI
|
437
|
+
component would require characters that cannot be typed into some
|
438
|
+
systems. The ability to transcribe a resource identifier from one
|
439
|
+
medium to another has been considered more important than having a
|
440
|
+
URI consist of the most meaningful of components.
|
441
|
+
|
442
|
+
In local or regional contexts and with improving technology, users
|
443
|
+
might benefit from being able to use a wider range of characters;
|
444
|
+
such use is not defined by this specification. Percent-encoded
|
445
|
+
octets (Section 2.1) may be used within a URI to represent characters
|
446
|
+
outside the range of the US-ASCII coded character set if this
|
447
|
+
|
448
|
+
|
449
|
+
|
450
|
+
Berners-Lee, et al. Standards Track [Page 8]
|
451
|
+
|
452
|
+
RFC 3986 URI Generic Syntax January 2005
|
453
|
+
|
454
|
+
|
455
|
+
representation is allowed by the scheme or by the protocol element in
|
456
|
+
which the URI is referenced. Such a definition should specify the
|
457
|
+
character encoding used to map those characters to octets prior to
|
458
|
+
being percent-encoded for the URI.
|
459
|
+
|
460
|
+
1.2.2. Separating Identification from Interaction
|
461
|
+
|
462
|
+
A common misunderstanding of URIs is that they are only used to refer
|
463
|
+
to accessible resources. The URI itself only provides
|
464
|
+
identification; access to the resource is neither guaranteed nor
|
465
|
+
implied by the presence of a URI. Instead, any operation associated
|
466
|
+
with a URI reference is defined by the protocol element, data format
|
467
|
+
attribute, or natural language text in which it appears.
|
468
|
+
|
469
|
+
Given a URI, a system may attempt to perform a variety of operations
|
470
|
+
on the resource, as might be characterized by words such as "access",
|
471
|
+
"update", "replace", or "find attributes". Such operations are
|
472
|
+
defined by the protocols that make use of URIs, not by this
|
473
|
+
specification. However, we do use a few general terms for describing
|
474
|
+
common operations on URIs. URI "resolution" is the process of
|
475
|
+
determining an access mechanism and the appropriate parameters
|
476
|
+
necessary to dereference a URI; this resolution may require several
|
477
|
+
iterations. To use that access mechanism to perform an action on the
|
478
|
+
URI's resource is to "dereference" the URI.
|
479
|
+
|
480
|
+
When URIs are used within information retrieval systems to identify
|
481
|
+
sources of information, the most common form of URI dereference is
|
482
|
+
"retrieval": making use of a URI in order to retrieve a
|
483
|
+
representation of its associated resource. A "representation" is a
|
484
|
+
sequence of octets, along with representation metadata describing
|
485
|
+
those octets, that constitutes a record of the state of the resource
|
486
|
+
at the time when the representation is generated. Retrieval is
|
487
|
+
achieved by a process that might include using the URI as a cache key
|
488
|
+
to check for a locally cached representation, resolution of the URI
|
489
|
+
to determine an appropriate access mechanism (if any), and
|
490
|
+
dereference of the URI for the sake of applying a retrieval
|
491
|
+
operation. Depending on the protocols used to perform the retrieval,
|
492
|
+
additional information might be supplied about the resource (resource
|
493
|
+
metadata) and its relation to other resources.
|
494
|
+
|
495
|
+
URI references in information retrieval systems are designed to be
|
496
|
+
late-binding: the result of an access is generally determined when it
|
497
|
+
is accessed and may vary over time or due to other aspects of the
|
498
|
+
interaction. These references are created in order to be used in the
|
499
|
+
future: what is being identified is not some specific result that was
|
500
|
+
obtained in the past, but rather some characteristic that is expected
|
501
|
+
to be true for future results. In such cases, the resource referred
|
502
|
+
to by the URI is actually a sameness of characteristics as observed
|
503
|
+
|
504
|
+
|
505
|
+
|
506
|
+
Berners-Lee, et al. Standards Track [Page 9]
|
507
|
+
|
508
|
+
RFC 3986 URI Generic Syntax January 2005
|
509
|
+
|
510
|
+
|
511
|
+
over time, perhaps elucidated by additional comments or assertions
|
512
|
+
made by the resource provider.
|
513
|
+
|
514
|
+
Although many URI schemes are named after protocols, this does not
|
515
|
+
imply that use of these URIs will result in access to the resource
|
516
|
+
via the named protocol. URIs are often used simply for the sake of
|
517
|
+
identification. Even when a URI is used to retrieve a representation
|
518
|
+
of a resource, that access might be through gateways, proxies,
|
519
|
+
caches, and name resolution services that are independent of the
|
520
|
+
protocol associated with the scheme name. The resolution of some
|
521
|
+
URIs may require the use of more than one protocol (e.g., both DNS
|
522
|
+
and HTTP are typically used to access an "http" URI's origin server
|
523
|
+
when a representation isn't found in a local cache).
|
524
|
+
|
525
|
+
1.2.3. Hierarchical Identifiers
|
526
|
+
|
527
|
+
The URI syntax is organized hierarchically, with components listed in
|
528
|
+
order of decreasing significance from left to right. For some URI
|
529
|
+
schemes, the visible hierarchy is limited to the scheme itself:
|
530
|
+
everything after the scheme component delimiter (":") is considered
|
531
|
+
opaque to URI processing. Other URI schemes make the hierarchy
|
532
|
+
explicit and visible to generic parsing algorithms.
|
533
|
+
|
534
|
+
The generic syntax uses the slash ("/"), question mark ("?"), and
|
535
|
+
number sign ("#") characters to delimit components that are
|
536
|
+
significant to the generic parser's hierarchical interpretation of an
|
537
|
+
identifier. In addition to aiding the readability of such
|
538
|
+
identifiers through the consistent use of familiar syntax, this
|
539
|
+
uniform representation of hierarchy across naming schemes allows
|
540
|
+
scheme-independent references to be made relative to that hierarchy.
|
541
|
+
|
542
|
+
It is often the case that a group or "tree" of documents has been
|
543
|
+
constructed to serve a common purpose, wherein the vast majority of
|
544
|
+
URI references in these documents point to resources within the tree
|
545
|
+
rather than outside it. Similarly, documents located at a particular
|
546
|
+
site are much more likely to refer to other resources at that site
|
547
|
+
than to resources at remote sites. Relative referencing of URIs
|
548
|
+
allows document trees to be partially independent of their location
|
549
|
+
and access scheme. For instance, it is possible for a single set of
|
550
|
+
hypertext documents to be simultaneously accessible and traversable
|
551
|
+
via each of the "file", "http", and "ftp" schemes if the documents
|
552
|
+
refer to each other with relative references. Furthermore, such
|
553
|
+
document trees can be moved, as a whole, without changing any of the
|
554
|
+
relative references.
|
555
|
+
|
556
|
+
A relative reference (Section 4.2) refers to a resource by describing
|
557
|
+
the difference within a hierarchical name space between the reference
|
558
|
+
context and the target URI. The reference resolution algorithm,
|
559
|
+
|
560
|
+
|
561
|
+
|
562
|
+
Berners-Lee, et al. Standards Track [Page 10]
|
563
|
+
|
564
|
+
RFC 3986 URI Generic Syntax January 2005
|
565
|
+
|
566
|
+
|
567
|
+
presented in Section 5, defines how such a reference is transformed
|
568
|
+
to the target URI. As relative references can only be used within
|
569
|
+
the context of a hierarchical URI, designers of new URI schemes
|
570
|
+
should use a syntax consistent with the generic syntax's hierarchical
|
571
|
+
components unless there are compelling reasons to forbid relative
|
572
|
+
referencing within that scheme.
|
573
|
+
|
574
|
+
NOTE: Previous specifications used the terms "partial URI" and
|
575
|
+
"relative URI" to denote a relative reference to a URI. As some
|
576
|
+
readers misunderstood those terms to mean that relative URIs are a
|
577
|
+
subset of URIs rather than a method of referencing URIs, this
|
578
|
+
specification simply refers to them as relative references.
|
579
|
+
|
580
|
+
All URI references are parsed by generic syntax parsers when used.
|
581
|
+
However, because hierarchical processing has no effect on an absolute
|
582
|
+
URI used in a reference unless it contains one or more dot-segments
|
583
|
+
(complete path segments of "." or "..", as described in Section 3.3),
|
584
|
+
URI scheme specifications can define opaque identifiers by
|
585
|
+
disallowing use of slash characters, question mark characters, and
|
586
|
+
the URIs "scheme:." and "scheme:..".
|
587
|
+
|
588
|
+
1.3. Syntax Notation
|
589
|
+
|
590
|
+
This specification uses the Augmented Backus-Naur Form (ABNF)
|
591
|
+
notation of [RFC2234], including the following core ABNF syntax rules
|
592
|
+
defined by that specification: ALPHA (letters), CR (carriage return),
|
593
|
+
DIGIT (decimal digits), DQUOTE (double quote), HEXDIG (hexadecimal
|
594
|
+
digits), LF (line feed), and SP (space). The complete URI syntax is
|
595
|
+
collected in Appendix A.
|
596
|
+
|
597
|
+
2. Characters
|
598
|
+
|
599
|
+
The URI syntax provides a method of encoding data, presumably for the
|
600
|
+
sake of identifying a resource, as a sequence of characters. The URI
|
601
|
+
characters are, in turn, frequently encoded as octets for transport
|
602
|
+
or presentation. This specification does not mandate any particular
|
603
|
+
character encoding for mapping between URI characters and the octets
|
604
|
+
used to store or transmit those characters. When a URI appears in a
|
605
|
+
protocol element, the character encoding is defined by that protocol;
|
606
|
+
without such a definition, a URI is assumed to be in the same
|
607
|
+
character encoding as the surrounding text.
|
608
|
+
|
609
|
+
The ABNF notation defines its terminal values to be non-negative
|
610
|
+
integers (codepoints) based on the US-ASCII coded character set
|
611
|
+
[ASCII]. Because a URI is a sequence of characters, we must invert
|
612
|
+
that relation in order to understand the URI syntax. Therefore, the
|
613
|
+
|
614
|
+
|
615
|
+
|
616
|
+
|
617
|
+
|
618
|
+
Berners-Lee, et al. Standards Track [Page 11]
|
619
|
+
|
620
|
+
RFC 3986 URI Generic Syntax January 2005
|
621
|
+
|
622
|
+
|
623
|
+
integer values used by the ABNF must be mapped back to their
|
624
|
+
corresponding characters via US-ASCII in order to complete the syntax
|
625
|
+
rules.
|
626
|
+
|
627
|
+
A URI is composed from a limited set of characters consisting of
|
628
|
+
digits, letters, and a few graphic symbols. A reserved subset of
|
629
|
+
those characters may be used to delimit syntax components within a
|
630
|
+
URI while the remaining characters, including both the unreserved set
|
631
|
+
and those reserved characters not acting as delimiters, define each
|
632
|
+
component's identifying data.
|
633
|
+
|
634
|
+
2.1. Percent-Encoding
|
635
|
+
|
636
|
+
A percent-encoding mechanism is used to represent a data octet in a
|
637
|
+
component when that octet's corresponding character is outside the
|
638
|
+
allowed set or is being used as a delimiter of, or within, the
|
639
|
+
component. A percent-encoded octet is encoded as a character
|
640
|
+
triplet, consisting of the percent character "%" followed by the two
|
641
|
+
hexadecimal digits representing that octet's numeric value. For
|
642
|
+
example, "%20" is the percent-encoding for the binary octet
|
643
|
+
"00100000" (ABNF: %x20), which in US-ASCII corresponds to the space
|
644
|
+
character (SP). Section 2.4 describes when percent-encoding and
|
645
|
+
decoding is applied.
|
646
|
+
|
647
|
+
pct-encoded = "%" HEXDIG HEXDIG
|
648
|
+
|
649
|
+
The uppercase hexadecimal digits 'A' through 'F' are equivalent to
|
650
|
+
the lowercase digits 'a' through 'f', respectively. If two URIs
|
651
|
+
differ only in the case of hexadecimal digits used in percent-encoded
|
652
|
+
octets, they are equivalent. For consistency, URI producers and
|
653
|
+
normalizers should use uppercase hexadecimal digits for all percent-
|
654
|
+
encodings.
|
655
|
+
|
656
|
+
2.2. Reserved Characters
|
657
|
+
|
658
|
+
URIs include components and subcomponents that are delimited by
|
659
|
+
characters in the "reserved" set. These characters are called
|
660
|
+
"reserved" because they may (or may not) be defined as delimiters by
|
661
|
+
the generic syntax, by each scheme-specific syntax, or by the
|
662
|
+
implementation-specific syntax of a URI's dereferencing algorithm.
|
663
|
+
If data for a URI component would conflict with a reserved
|
664
|
+
character's purpose as a delimiter, then the conflicting data must be
|
665
|
+
percent-encoded before the URI is formed.
|
666
|
+
|
667
|
+
|
668
|
+
|
669
|
+
|
670
|
+
|
671
|
+
|
672
|
+
|
673
|
+
|
674
|
+
Berners-Lee, et al. Standards Track [Page 12]
|
675
|
+
|
676
|
+
RFC 3986 URI Generic Syntax January 2005
|
677
|
+
|
678
|
+
|
679
|
+
reserved = gen-delims / sub-delims
|
680
|
+
|
681
|
+
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
|
682
|
+
|
683
|
+
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
|
684
|
+
/ "*" / "+" / "," / ";" / "="
|
685
|
+
|
686
|
+
The purpose of reserved characters is to provide a set of delimiting
|
687
|
+
characters that are distinguishable from other data within a URI.
|
688
|
+
URIs that differ in the replacement of a reserved character with its
|
689
|
+
corresponding percent-encoded octet are not equivalent. Percent-
|
690
|
+
encoding a reserved character, or decoding a percent-encoded octet
|
691
|
+
that corresponds to a reserved character, will change how the URI is
|
692
|
+
interpreted by most applications. Thus, characters in the reserved
|
693
|
+
set are protected from normalization and are therefore safe to be
|
694
|
+
used by scheme-specific and producer-specific algorithms for
|
695
|
+
delimiting data subcomponents within a URI.
|
696
|
+
|
697
|
+
A subset of the reserved characters (gen-delims) is used as
|
698
|
+
delimiters of the generic URI components described in Section 3. A
|
699
|
+
component's ABNF syntax rule will not use the reserved or gen-delims
|
700
|
+
rule names directly; instead, each syntax rule lists the characters
|
701
|
+
allowed within that component (i.e., not delimiting it), and any of
|
702
|
+
those characters that are also in the reserved set are "reserved" for
|
703
|
+
use as subcomponent delimiters within the component. Only the most
|
704
|
+
common subcomponents are defined by this specification; other
|
705
|
+
subcomponents may be defined by a URI scheme's specification, or by
|
706
|
+
the implementation-specific syntax of a URI's dereferencing
|
707
|
+
algorithm, provided that such subcomponents are delimited by
|
708
|
+
characters in the reserved set allowed within that component.
|
709
|
+
|
710
|
+
URI producing applications should percent-encode data octets that
|
711
|
+
correspond to characters in the reserved set unless these characters
|
712
|
+
are specifically allowed by the URI scheme to represent data in that
|
713
|
+
component. If a reserved character is found in a URI component and
|
714
|
+
no delimiting role is known for that character, then it must be
|
715
|
+
interpreted as representing the data octet corresponding to that
|
716
|
+
character's encoding in US-ASCII.
|
717
|
+
|
718
|
+
2.3. Unreserved Characters
|
719
|
+
|
720
|
+
Characters that are allowed in a URI but do not have a reserved
|
721
|
+
purpose are called unreserved. These include uppercase and lowercase
|
722
|
+
letters, decimal digits, hyphen, period, underscore, and tilde.
|
723
|
+
|
724
|
+
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
|
725
|
+
|
726
|
+
|
727
|
+
|
728
|
+
|
729
|
+
|
730
|
+
Berners-Lee, et al. Standards Track [Page 13]
|
731
|
+
|
732
|
+
RFC 3986 URI Generic Syntax January 2005
|
733
|
+
|
734
|
+
|
735
|
+
URIs that differ in the replacement of an unreserved character with
|
736
|
+
its corresponding percent-encoded US-ASCII octet are equivalent: they
|
737
|
+
identify the same resource. However, URI comparison implementations
|
738
|
+
do not always perform normalization prior to comparison (see Section
|
739
|
+
6). For consistency, percent-encoded octets in the ranges of ALPHA
|
740
|
+
(%41-%5A and %61-%7A), DIGIT (%30-%39), hyphen (%2D), period (%2E),
|
741
|
+
underscore (%5F), or tilde (%7E) should not be created by URI
|
742
|
+
producers and, when found in a URI, should be decoded to their
|
743
|
+
corresponding unreserved characters by URI normalizers.
|
744
|
+
|
745
|
+
2.4. When to Encode or Decode
|
746
|
+
|
747
|
+
Under normal circumstances, the only time when octets within a URI
|
748
|
+
are percent-encoded is during the process of producing the URI from
|
749
|
+
its component parts. This is when an implementation determines which
|
750
|
+
of the reserved characters are to be used as subcomponent delimiters
|
751
|
+
and which can be safely used as data. Once produced, a URI is always
|
752
|
+
in its percent-encoded form.
|
753
|
+
|
754
|
+
When a URI is dereferenced, the components and subcomponents
|
755
|
+
significant to the scheme-specific dereferencing process (if any)
|
756
|
+
must be parsed and separated before the percent-encoded octets within
|
757
|
+
those components can be safely decoded, as otherwise the data may be
|
758
|
+
mistaken for component delimiters. The only exception is for
|
759
|
+
percent-encoded octets corresponding to characters in the unreserved
|
760
|
+
set, which can be decoded at any time. For example, the octet
|
761
|
+
corresponding to the tilde ("~") character is often encoded as "%7E"
|
762
|
+
by older URI processing implementations; the "%7E" can be replaced by
|
763
|
+
"~" without changing its interpretation.
|
764
|
+
|
765
|
+
Because the percent ("%") character serves as the indicator for
|
766
|
+
percent-encoded octets, it must be percent-encoded as "%25" for that
|
767
|
+
octet to be used as data within a URI. Implementations must not
|
768
|
+
percent-encode or decode the same string more than once, as decoding
|
769
|
+
an already decoded string might lead to misinterpreting a percent
|
770
|
+
data octet as the beginning of a percent-encoding, or vice versa in
|
771
|
+
the case of percent-encoding an already percent-encoded string.
|
772
|
+
|
773
|
+
2.5. Identifying Data
|
774
|
+
|
775
|
+
URI characters provide identifying data for each of the URI
|
776
|
+
components, serving as an external interface for identification
|
777
|
+
between systems. Although the presence and nature of the URI
|
778
|
+
production interface is hidden from clients that use its URIs (and is
|
779
|
+
thus beyond the scope of the interoperability requirements defined by
|
780
|
+
this specification), it is a frequent source of confusion and errors
|
781
|
+
in the interpretation of URI character issues. Implementers have to
|
782
|
+
be aware that there are multiple character encodings involved in the
|
783
|
+
|
784
|
+
|
785
|
+
|
786
|
+
Berners-Lee, et al. Standards Track [Page 14]
|
787
|
+
|
788
|
+
RFC 3986 URI Generic Syntax January 2005
|
789
|
+
|
790
|
+
|
791
|
+
production and transmission of URIs: local name and data encoding,
|
792
|
+
public interface encoding, URI character encoding, data format
|
793
|
+
encoding, and protocol encoding.
|
794
|
+
|
795
|
+
Local names, such as file system names, are stored with a local
|
796
|
+
character encoding. URI producing applications (e.g., origin
|
797
|
+
servers) will typically use the local encoding as the basis for
|
798
|
+
producing meaningful names. The URI producer will transform the
|
799
|
+
local encoding to one that is suitable for a public interface and
|
800
|
+
then transform the public interface encoding into the restricted set
|
801
|
+
of URI characters (reserved, unreserved, and percent-encodings).
|
802
|
+
Those characters are, in turn, encoded as octets to be used as a
|
803
|
+
reference within a data format (e.g., a document charset), and such
|
804
|
+
data formats are often subsequently encoded for transmission over
|
805
|
+
Internet protocols.
|
806
|
+
|
807
|
+
For most systems, an unreserved character appearing within a URI
|
808
|
+
component is interpreted as representing the data octet corresponding
|
809
|
+
to that character's encoding in US-ASCII. Consumers of URIs assume
|
810
|
+
that the letter "X" corresponds to the octet "01011000", and even
|
811
|
+
when that assumption is incorrect, there is no harm in making it. A
|
812
|
+
system that internally provides identifiers in the form of a
|
813
|
+
different character encoding, such as EBCDIC, will generally perform
|
814
|
+
character translation of textual identifiers to UTF-8 [STD63] (or
|
815
|
+
some other superset of the US-ASCII character encoding) at an
|
816
|
+
internal interface, thereby providing more meaningful identifiers
|
817
|
+
than those resulting from simply percent-encoding the original
|
818
|
+
octets.
|
819
|
+
|
820
|
+
For example, consider an information service that provides data,
|
821
|
+
stored locally using an EBCDIC-based file system, to clients on the
|
822
|
+
Internet through an HTTP server. When an author creates a file with
|
823
|
+
the name "Laguna Beach" on that file system, the "http" URI
|
824
|
+
corresponding to that resource is expected to contain the meaningful
|
825
|
+
string "Laguna%20Beach". If, however, that server produces URIs by
|
826
|
+
using an overly simplistic raw octet mapping, then the result would
|
827
|
+
be a URI containing "%D3%81%87%A4%95%81@%C2%85%81%83%88". An
|
828
|
+
internal transcoding interface fixes this problem by transcoding the
|
829
|
+
local name to a superset of US-ASCII prior to producing the URI.
|
830
|
+
Naturally, proper interpretation of an incoming URI on such an
|
831
|
+
interface requires that percent-encoded octets be decoded (e.g.,
|
832
|
+
"%20" to SP) before the reverse transcoding is applied to obtain the
|
833
|
+
local name.
|
834
|
+
|
835
|
+
In some cases, the internal interface between a URI component and the
|
836
|
+
identifying data that it has been crafted to represent is much less
|
837
|
+
direct than a character encoding translation. For example, portions
|
838
|
+
of a URI might reflect a query on non-ASCII data, or numeric
|
839
|
+
|
840
|
+
|
841
|
+
|
842
|
+
Berners-Lee, et al. Standards Track [Page 15]
|
843
|
+
|
844
|
+
RFC 3986 URI Generic Syntax January 2005
|
845
|
+
|
846
|
+
|
847
|
+
coordinates on a map. Likewise, a URI scheme may define components
|
848
|
+
with additional encoding requirements that are applied prior to
|
849
|
+
forming the component and producing the URI.
|
850
|
+
|
851
|
+
When a new URI scheme defines a component that represents textual
|
852
|
+
data consisting of characters from the Universal Character Set [UCS],
|
853
|
+
the data should first be encoded as octets according to the UTF-8
|
854
|
+
character encoding [STD63]; then only those octets that do not
|
855
|
+
correspond to characters in the unreserved set should be percent-
|
856
|
+
encoded. For example, the character A would be represented as "A",
|
857
|
+
the character LATIN CAPITAL LETTER A WITH GRAVE would be represented
|
858
|
+
as "%C3%80", and the character KATAKANA LETTER A would be represented
|
859
|
+
as "%E3%82%A2".
|
860
|
+
|
861
|
+
3. Syntax Components
|
862
|
+
|
863
|
+
The generic URI syntax consists of a hierarchical sequence of
|
864
|
+
components referred to as the scheme, authority, path, query, and
|
865
|
+
fragment.
|
866
|
+
|
867
|
+
URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
|
868
|
+
|
869
|
+
hier-part = "//" authority path-abempty
|
870
|
+
/ path-absolute
|
871
|
+
/ path-rootless
|
872
|
+
/ path-empty
|
873
|
+
|
874
|
+
The scheme and path components are required, though the path may be
|
875
|
+
empty (no characters). When authority is present, the path must
|
876
|
+
either be empty or begin with a slash ("/") character. When
|
877
|
+
authority is not present, the path cannot begin with two slash
|
878
|
+
characters ("//"). These restrictions result in five different ABNF
|
879
|
+
rules for a path (Section 3.3), only one of which will match any
|
880
|
+
given URI reference.
|
881
|
+
|
882
|
+
The following are two example URIs and their component parts:
|
883
|
+
|
884
|
+
foo://example.com:8042/over/there?name=ferret#nose
|
885
|
+
\_/ \______________/\_________/ \_________/ \__/
|
886
|
+
| | | | |
|
887
|
+
scheme authority path query fragment
|
888
|
+
| _____________________|__
|
889
|
+
/ \ / \
|
890
|
+
urn:example:animal:ferret:nose
|
891
|
+
|
892
|
+
|
893
|
+
|
894
|
+
|
895
|
+
|
896
|
+
|
897
|
+
|
898
|
+
Berners-Lee, et al. Standards Track [Page 16]
|
899
|
+
|
900
|
+
RFC 3986 URI Generic Syntax January 2005
|
901
|
+
|
902
|
+
|
903
|
+
3.1. Scheme
|
904
|
+
|
905
|
+
Each URI begins with a scheme name that refers to a specification for
|
906
|
+
assigning identifiers within that scheme. As such, the URI syntax is
|
907
|
+
a federated and extensible naming system wherein each scheme's
|
908
|
+
specification may further restrict the syntax and semantics of
|
909
|
+
identifiers using that scheme.
|
910
|
+
|
911
|
+
Scheme names consist of a sequence of characters beginning with a
|
912
|
+
letter and followed by any combination of letters, digits, plus
|
913
|
+
("+"), period ("."), or hyphen ("-"). Although schemes are case-
|
914
|
+
insensitive, the canonical form is lowercase and documents that
|
915
|
+
specify schemes must do so with lowercase letters. An implementation
|
916
|
+
should accept uppercase letters as equivalent to lowercase in scheme
|
917
|
+
names (e.g., allow "HTTP" as well as "http") for the sake of
|
918
|
+
robustness but should only produce lowercase scheme names for
|
919
|
+
consistency.
|
920
|
+
|
921
|
+
scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
|
922
|
+
|
923
|
+
Individual schemes are not specified by this document. The process
|
924
|
+
for registration of new URI schemes is defined separately by [BCP35].
|
925
|
+
The scheme registry maintains the mapping between scheme names and
|
926
|
+
their specifications. Advice for designers of new URI schemes can be
|
927
|
+
found in [RFC2718]. URI scheme specifications must define their own
|
928
|
+
syntax so that all strings matching their scheme-specific syntax will
|
929
|
+
also match the <absolute-URI> grammar, as described in Section 4.3.
|
930
|
+
|
931
|
+
When presented with a URI that violates one or more scheme-specific
|
932
|
+
restrictions, the scheme-specific resolution process should flag the
|
933
|
+
reference as an error rather than ignore the unused parts; doing so
|
934
|
+
reduces the number of equivalent URIs and helps detect abuses of the
|
935
|
+
generic syntax, which might indicate that the URI has been
|
936
|
+
constructed to mislead the user (Section 7.6).
|
937
|
+
|
938
|
+
3.2. Authority
|
939
|
+
|
940
|
+
Many URI schemes include a hierarchical element for a naming
|
941
|
+
authority so that governance of the name space defined by the
|
942
|
+
remainder of the URI is delegated to that authority (which may, in
|
943
|
+
turn, delegate it further). The generic syntax provides a common
|
944
|
+
means for distinguishing an authority based on a registered name or
|
945
|
+
server address, along with optional port and user information.
|
946
|
+
|
947
|
+
The authority component is preceded by a double slash ("//") and is
|
948
|
+
terminated by the next slash ("/"), question mark ("?"), or number
|
949
|
+
sign ("#") character, or by the end of the URI.
|
950
|
+
|
951
|
+
|
952
|
+
|
953
|
+
|
954
|
+
Berners-Lee, et al. Standards Track [Page 17]
|
955
|
+
|
956
|
+
RFC 3986 URI Generic Syntax January 2005
|
957
|
+
|
958
|
+
|
959
|
+
authority = [ userinfo "@" ] host [ ":" port ]
|
960
|
+
|
961
|
+
URI producers and normalizers should omit the ":" delimiter that
|
962
|
+
separates host from port if the port component is empty. Some
|
963
|
+
schemes do not allow the userinfo and/or port subcomponents.
|
964
|
+
|
965
|
+
If a URI contains an authority component, then the path component
|
966
|
+
must either be empty or begin with a slash ("/") character. Non-
|
967
|
+
validating parsers (those that merely separate a URI reference into
|
968
|
+
its major components) will often ignore the subcomponent structure of
|
969
|
+
authority, treating it as an opaque string from the double-slash to
|
970
|
+
the first terminating delimiter, until such time as the URI is
|
971
|
+
dereferenced.
|
972
|
+
|
973
|
+
3.2.1. User Information
|
974
|
+
|
975
|
+
The userinfo subcomponent may consist of a user name and, optionally,
|
976
|
+
scheme-specific information about how to gain authorization to access
|
977
|
+
the resource. The user information, if present, is followed by a
|
978
|
+
commercial at-sign ("@") that delimits it from the host.
|
979
|
+
|
980
|
+
userinfo = *( unreserved / pct-encoded / sub-delims / ":" )
|
981
|
+
|
982
|
+
Use of the format "user:password" in the userinfo field is
|
983
|
+
deprecated. Applications should not render as clear text any data
|
984
|
+
after the first colon (":") character found within a userinfo
|
985
|
+
subcomponent unless the data after the colon is the empty string
|
986
|
+
(indicating no password). Applications may choose to ignore or
|
987
|
+
reject such data when it is received as part of a reference and
|
988
|
+
should reject the storage of such data in unencrypted form. The
|
989
|
+
passing of authentication information in clear text has proven to be
|
990
|
+
a security risk in almost every case where it has been used.
|
991
|
+
|
992
|
+
Applications that render a URI for the sake of user feedback, such as
|
993
|
+
in graphical hypertext browsing, should render userinfo in a way that
|
994
|
+
is distinguished from the rest of a URI, when feasible. Such
|
995
|
+
rendering will assist the user in cases where the userinfo has been
|
996
|
+
misleadingly crafted to look like a trusted domain name
|
997
|
+
(Section 7.6).
|
998
|
+
|
999
|
+
3.2.2. Host
|
1000
|
+
|
1001
|
+
The host subcomponent of authority is identified by an IP literal
|
1002
|
+
encapsulated within square brackets, an IPv4 address in dotted-
|
1003
|
+
decimal form, or a registered name. The host subcomponent is case-
|
1004
|
+
insensitive. The presence of a host subcomponent within a URI does
|
1005
|
+
not imply that the scheme requires access to the given host on the
|
1006
|
+
Internet. In many cases, the host syntax is used only for the sake
|
1007
|
+
|
1008
|
+
|
1009
|
+
|
1010
|
+
Berners-Lee, et al. Standards Track [Page 18]
|
1011
|
+
|
1012
|
+
RFC 3986 URI Generic Syntax January 2005
|
1013
|
+
|
1014
|
+
|
1015
|
+
of reusing the existing registration process created and deployed for
|
1016
|
+
DNS, thus obtaining a globally unique name without the cost of
|
1017
|
+
deploying another registry. However, such use comes with its own
|
1018
|
+
costs: domain name ownership may change over time for reasons not
|
1019
|
+
anticipated by the URI producer. In other cases, the data within the
|
1020
|
+
host component identifies a registered name that has nothing to do
|
1021
|
+
with an Internet host. We use the name "host" for the ABNF rule
|
1022
|
+
because that is its most common purpose, not its only purpose.
|
1023
|
+
|
1024
|
+
host = IP-literal / IPv4address / reg-name
|
1025
|
+
|
1026
|
+
The syntax rule for host is ambiguous because it does not completely
|
1027
|
+
distinguish between an IPv4address and a reg-name. In order to
|
1028
|
+
disambiguate the syntax, we apply the "first-match-wins" algorithm:
|
1029
|
+
If host matches the rule for IPv4address, then it should be
|
1030
|
+
considered an IPv4 address literal and not a reg-name. Although host
|
1031
|
+
is case-insensitive, producers and normalizers should use lowercase
|
1032
|
+
for registered names and hexadecimal addresses for the sake of
|
1033
|
+
uniformity, while only using uppercase letters for percent-encodings.
|
1034
|
+
|
1035
|
+
A host identified by an Internet Protocol literal address, version 6
|
1036
|
+
[RFC3513] or later, is distinguished by enclosing the IP literal
|
1037
|
+
within square brackets ("[" and "]"). This is the only place where
|
1038
|
+
square bracket characters are allowed in the URI syntax. In
|
1039
|
+
anticipation of future, as-yet-undefined IP literal address formats,
|
1040
|
+
an implementation may use an optional version flag to indicate such a
|
1041
|
+
format explicitly rather than rely on heuristic determination.
|
1042
|
+
|
1043
|
+
IP-literal = "[" ( IPv6address / IPvFuture ) "]"
|
1044
|
+
|
1045
|
+
IPvFuture = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )
|
1046
|
+
|
1047
|
+
The version flag does not indicate the IP version; rather, it
|
1048
|
+
indicates future versions of the literal format. As such,
|
1049
|
+
implementations must not provide the version flag for the existing
|
1050
|
+
IPv4 and IPv6 literal address forms described below. If a URI
|
1051
|
+
containing an IP-literal that starts with "v" (case-insensitive),
|
1052
|
+
indicating that the version flag is present, is dereferenced by an
|
1053
|
+
application that does not know the meaning of that version flag, then
|
1054
|
+
the application should return an appropriate error for "address
|
1055
|
+
mechanism not supported".
|
1056
|
+
|
1057
|
+
A host identified by an IPv6 literal address is represented inside
|
1058
|
+
the square brackets without a preceding version flag. The ABNF
|
1059
|
+
provided here is a translation of the text definition of an IPv6
|
1060
|
+
literal address provided in [RFC3513]. This syntax does not support
|
1061
|
+
IPv6 scoped addressing zone identifiers.
|
1062
|
+
|
1063
|
+
|
1064
|
+
|
1065
|
+
|
1066
|
+
Berners-Lee, et al. Standards Track [Page 19]
|
1067
|
+
|
1068
|
+
RFC 3986 URI Generic Syntax January 2005
|
1069
|
+
|
1070
|
+
|
1071
|
+
A 128-bit IPv6 address is divided into eight 16-bit pieces. Each
|
1072
|
+
piece is represented numerically in case-insensitive hexadecimal,
|
1073
|
+
using one to four hexadecimal digits (leading zeroes are permitted).
|
1074
|
+
The eight encoded pieces are given most-significant first, separated
|
1075
|
+
by colon characters. Optionally, the least-significant two pieces
|
1076
|
+
may instead be represented in IPv4 address textual format. A
|
1077
|
+
sequence of one or more consecutive zero-valued 16-bit pieces within
|
1078
|
+
the address may be elided, omitting all their digits and leaving
|
1079
|
+
exactly two consecutive colons in their place to mark the elision.
|
1080
|
+
|
1081
|
+
IPv6address = 6( h16 ":" ) ls32
|
1082
|
+
/ "::" 5( h16 ":" ) ls32
|
1083
|
+
/ [ h16 ] "::" 4( h16 ":" ) ls32
|
1084
|
+
/ [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
|
1085
|
+
/ [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
|
1086
|
+
/ [ *3( h16 ":" ) h16 ] "::" h16 ":" ls32
|
1087
|
+
/ [ *4( h16 ":" ) h16 ] "::" ls32
|
1088
|
+
/ [ *5( h16 ":" ) h16 ] "::" h16
|
1089
|
+
/ [ *6( h16 ":" ) h16 ] "::"
|
1090
|
+
|
1091
|
+
ls32 = ( h16 ":" h16 ) / IPv4address
|
1092
|
+
; least-significant 32 bits of address
|
1093
|
+
|
1094
|
+
h16 = 1*4HEXDIG
|
1095
|
+
; 16 bits of address represented in hexadecimal
|
1096
|
+
|
1097
|
+
A host identified by an IPv4 literal address is represented in
|
1098
|
+
dotted-decimal notation (a sequence of four decimal numbers in the
|
1099
|
+
range 0 to 255, separated by "."), as described in [RFC1123] by
|
1100
|
+
reference to [RFC0952]. Note that other forms of dotted notation may
|
1101
|
+
be interpreted on some platforms, as described in Section 7.4, but
|
1102
|
+
only the dotted-decimal form of four octets is allowed by this
|
1103
|
+
grammar.
|
1104
|
+
|
1105
|
+
IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet
|
1106
|
+
|
1107
|
+
dec-octet = DIGIT ; 0-9
|
1108
|
+
/ %x31-39 DIGIT ; 10-99
|
1109
|
+
/ "1" 2DIGIT ; 100-199
|
1110
|
+
/ "2" %x30-34 DIGIT ; 200-249
|
1111
|
+
/ "25" %x30-35 ; 250-255
|
1112
|
+
|
1113
|
+
A host identified by a registered name is a sequence of characters
|
1114
|
+
usually intended for lookup within a locally defined host or service
|
1115
|
+
name registry, though the URI's scheme-specific semantics may require
|
1116
|
+
that a specific registry (or fixed name table) be used instead. The
|
1117
|
+
most common name registry mechanism is the Domain Name System (DNS).
|
1118
|
+
A registered name intended for lookup in the DNS uses the syntax
|
1119
|
+
|
1120
|
+
|
1121
|
+
|
1122
|
+
Berners-Lee, et al. Standards Track [Page 20]
|
1123
|
+
|
1124
|
+
RFC 3986 URI Generic Syntax January 2005
|
1125
|
+
|
1126
|
+
|
1127
|
+
defined in Section 3.5 of [RFC1034] and Section 2.1 of [RFC1123].
|
1128
|
+
Such a name consists of a sequence of domain labels separated by ".",
|
1129
|
+
each domain label starting and ending with an alphanumeric character
|
1130
|
+
and possibly also containing "-" characters. The rightmost domain
|
1131
|
+
label of a fully qualified domain name in DNS may be followed by a
|
1132
|
+
single "." and should be if it is necessary to distinguish between
|
1133
|
+
the complete domain name and some local domain.
|
1134
|
+
|
1135
|
+
reg-name = *( unreserved / pct-encoded / sub-delims )
|
1136
|
+
|
1137
|
+
If the URI scheme defines a default for host, then that default
|
1138
|
+
applies when the host subcomponent is undefined or when the
|
1139
|
+
registered name is empty (zero length). For example, the "file" URI
|
1140
|
+
scheme is defined so that no authority, an empty host, and
|
1141
|
+
"localhost" all mean the end-user's machine, whereas the "http"
|
1142
|
+
scheme considers a missing authority or empty host invalid.
|
1143
|
+
|
1144
|
+
This specification does not mandate a particular registered name
|
1145
|
+
lookup technology and therefore does not restrict the syntax of reg-
|
1146
|
+
name beyond what is necessary for interoperability. Instead, it
|
1147
|
+
delegates the issue of registered name syntax conformance to the
|
1148
|
+
operating system of each application performing URI resolution, and
|
1149
|
+
that operating system decides what it will allow for the purpose of
|
1150
|
+
host identification. A URI resolution implementation might use DNS,
|
1151
|
+
host tables, yellow pages, NetInfo, WINS, or any other system for
|
1152
|
+
lookup of registered names. However, a globally scoped naming
|
1153
|
+
system, such as DNS fully qualified domain names, is necessary for
|
1154
|
+
URIs intended to have global scope. URI producers should use names
|
1155
|
+
that conform to the DNS syntax, even when use of DNS is not
|
1156
|
+
immediately apparent, and should limit these names to no more than
|
1157
|
+
255 characters in length.
|
1158
|
+
|
1159
|
+
The reg-name syntax allows percent-encoded octets in order to
|
1160
|
+
represent non-ASCII registered names in a uniform way that is
|
1161
|
+
independent of the underlying name resolution technology. Non-ASCII
|
1162
|
+
characters must first be encoded according to UTF-8 [STD63], and then
|
1163
|
+
each octet of the corresponding UTF-8 sequence must be percent-
|
1164
|
+
encoded to be represented as URI characters. URI producing
|
1165
|
+
applications must not use percent-encoding in host unless it is used
|
1166
|
+
to represent a UTF-8 character sequence. When a non-ASCII registered
|
1167
|
+
name represents an internationalized domain name intended for
|
1168
|
+
resolution via the DNS, the name must be transformed to the IDNA
|
1169
|
+
encoding [RFC3490] prior to name lookup. URI producers should
|
1170
|
+
provide these registered names in the IDNA encoding, rather than a
|
1171
|
+
percent-encoding, if they wish to maximize interoperability with
|
1172
|
+
legacy URI resolvers.
|
1173
|
+
|
1174
|
+
|
1175
|
+
|
1176
|
+
|
1177
|
+
|
1178
|
+
Berners-Lee, et al. Standards Track [Page 21]
|
1179
|
+
|
1180
|
+
RFC 3986 URI Generic Syntax January 2005
|
1181
|
+
|
1182
|
+
|
1183
|
+
3.2.3. Port
|
1184
|
+
|
1185
|
+
The port subcomponent of authority is designated by an optional port
|
1186
|
+
number in decimal following the host and delimited from it by a
|
1187
|
+
single colon (":") character.
|
1188
|
+
|
1189
|
+
port = *DIGIT
|
1190
|
+
|
1191
|
+
A scheme may define a default port. For example, the "http" scheme
|
1192
|
+
defines a default port of "80", corresponding to its reserved TCP
|
1193
|
+
port number. The type of port designated by the port number (e.g.,
|
1194
|
+
TCP, UDP, SCTP) is defined by the URI scheme. URI producers and
|
1195
|
+
normalizers should omit the port component and its ":" delimiter if
|
1196
|
+
port is empty or if its value would be the same as that of the
|
1197
|
+
scheme's default.
|
1198
|
+
|
1199
|
+
3.3. Path
|
1200
|
+
|
1201
|
+
The path component contains data, usually organized in hierarchical
|
1202
|
+
form, that, along with data in the non-hierarchical query component
|
1203
|
+
(Section 3.4), serves to identify a resource within the scope of the
|
1204
|
+
URI's scheme and naming authority (if any). The path is terminated
|
1205
|
+
by the first question mark ("?") or number sign ("#") character, or
|
1206
|
+
by the end of the URI.
|
1207
|
+
|
1208
|
+
If a URI contains an authority component, then the path component
|
1209
|
+
must either be empty or begin with a slash ("/") character. If a URI
|
1210
|
+
does not contain an authority component, then the path cannot begin
|
1211
|
+
with two slash characters ("//"). In addition, a URI reference
|
1212
|
+
(Section 4.1) may be a relative-path reference, in which case the
|
1213
|
+
first path segment cannot contain a colon (":") character. The ABNF
|
1214
|
+
requires five separate rules to disambiguate these cases, only one of
|
1215
|
+
which will match the path substring within a given URI reference. We
|
1216
|
+
use the generic term "path component" to describe the URI substring
|
1217
|
+
matched by the parser to one of these rules.
|
1218
|
+
|
1219
|
+
path = path-abempty ; begins with "/" or is empty
|
1220
|
+
/ path-absolute ; begins with "/" but not "//"
|
1221
|
+
/ path-noscheme ; begins with a non-colon segment
|
1222
|
+
/ path-rootless ; begins with a segment
|
1223
|
+
/ path-empty ; zero characters
|
1224
|
+
|
1225
|
+
path-abempty = *( "/" segment )
|
1226
|
+
path-absolute = "/" [ segment-nz *( "/" segment ) ]
|
1227
|
+
path-noscheme = segment-nz-nc *( "/" segment )
|
1228
|
+
path-rootless = segment-nz *( "/" segment )
|
1229
|
+
path-empty = 0<pchar>
|
1230
|
+
|
1231
|
+
|
1232
|
+
|
1233
|
+
|
1234
|
+
Berners-Lee, et al. Standards Track [Page 22]
|
1235
|
+
|
1236
|
+
RFC 3986 URI Generic Syntax January 2005
|
1237
|
+
|
1238
|
+
|
1239
|
+
segment = *pchar
|
1240
|
+
segment-nz = 1*pchar
|
1241
|
+
segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
|
1242
|
+
; non-zero-length segment without any colon ":"
|
1243
|
+
|
1244
|
+
pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
|
1245
|
+
|
1246
|
+
A path consists of a sequence of path segments separated by a slash
|
1247
|
+
("/") character. A path is always defined for a URI, though the
|
1248
|
+
defined path may be empty (zero length). Use of the slash character
|
1249
|
+
to indicate hierarchy is only required when a URI will be used as the
|
1250
|
+
context for relative references. For example, the URI
|
1251
|
+
<mailto:fred@example.com> has a path of "fred@example.com", whereas
|
1252
|
+
the URI <foo://info.example.com?fred> has an empty path.
|
1253
|
+
|
1254
|
+
The path segments "." and "..", also known as dot-segments, are
|
1255
|
+
defined for relative reference within the path name hierarchy. They
|
1256
|
+
are intended for use at the beginning of a relative-path reference
|
1257
|
+
(Section 4.2) to indicate relative position within the hierarchical
|
1258
|
+
tree of names. This is similar to their role within some operating
|
1259
|
+
systems' file directory structures to indicate the current directory
|
1260
|
+
and parent directory, respectively. However, unlike in a file
|
1261
|
+
system, these dot-segments are only interpreted within the URI path
|
1262
|
+
hierarchy and are removed as part of the resolution process (Section
|
1263
|
+
5.2).
|
1264
|
+
|
1265
|
+
Aside from dot-segments in hierarchical paths, a path segment is
|
1266
|
+
considered opaque by the generic syntax. URI producing applications
|
1267
|
+
often use the reserved characters allowed in a segment to delimit
|
1268
|
+
scheme-specific or dereference-handler-specific subcomponents. For
|
1269
|
+
example, the semicolon (";") and equals ("=") reserved characters are
|
1270
|
+
often used to delimit parameters and parameter values applicable to
|
1271
|
+
that segment. The comma (",") reserved character is often used for
|
1272
|
+
similar purposes. For example, one URI producer might use a segment
|
1273
|
+
such as "name;v=1.1" to indicate a reference to version 1.1 of
|
1274
|
+
"name", whereas another might use a segment such as "name,1.1" to
|
1275
|
+
indicate the same. Parameter types may be defined by scheme-specific
|
1276
|
+
semantics, but in most cases the syntax of a parameter is specific to
|
1277
|
+
the implementation of the URI's dereferencing algorithm.
|
1278
|
+
|
1279
|
+
3.4. Query
|
1280
|
+
|
1281
|
+
The query component contains non-hierarchical data that, along with
|
1282
|
+
data in the path component (Section 3.3), serves to identify a
|
1283
|
+
resource within the scope of the URI's scheme and naming authority
|
1284
|
+
(if any). The query component is indicated by the first question
|
1285
|
+
mark ("?") character and terminated by a number sign ("#") character
|
1286
|
+
or by the end of the URI.
|
1287
|
+
|
1288
|
+
|
1289
|
+
|
1290
|
+
Berners-Lee, et al. Standards Track [Page 23]
|
1291
|
+
|
1292
|
+
RFC 3986 URI Generic Syntax January 2005
|
1293
|
+
|
1294
|
+
|
1295
|
+
query = *( pchar / "/" / "?" )
|
1296
|
+
|
1297
|
+
The characters slash ("/") and question mark ("?") may represent data
|
1298
|
+
within the query component. Beware that some older, erroneous
|
1299
|
+
implementations may not handle such data correctly when it is used as
|
1300
|
+
the base URI for relative references (Section 5.1), apparently
|
1301
|
+
because they fail to distinguish query data from path data when
|
1302
|
+
looking for hierarchical separators. However, as query components
|
1303
|
+
are often used to carry identifying information in the form of
|
1304
|
+
"key=value" pairs and one frequently used value is a reference to
|
1305
|
+
another URI, it is sometimes better for usability to avoid percent-
|
1306
|
+
encoding those characters.
|
1307
|
+
|
1308
|
+
3.5. Fragment
|
1309
|
+
|
1310
|
+
The fragment identifier component of a URI allows indirect
|
1311
|
+
identification of a secondary resource by reference to a primary
|
1312
|
+
resource and additional identifying information. The identified
|
1313
|
+
secondary resource may be some portion or subset of the primary
|
1314
|
+
resource, some view on representations of the primary resource, or
|
1315
|
+
some other resource defined or described by those representations. A
|
1316
|
+
fragment identifier component is indicated by the presence of a
|
1317
|
+
number sign ("#") character and terminated by the end of the URI.
|
1318
|
+
|
1319
|
+
fragment = *( pchar / "/" / "?" )
|
1320
|
+
|
1321
|
+
The semantics of a fragment identifier are defined by the set of
|
1322
|
+
representations that might result from a retrieval action on the
|
1323
|
+
primary resource. The fragment's format and resolution is therefore
|
1324
|
+
dependent on the media type [RFC2046] of a potentially retrieved
|
1325
|
+
representation, even though such a retrieval is only performed if the
|
1326
|
+
URI is dereferenced. If no such representation exists, then the
|
1327
|
+
semantics of the fragment are considered unknown and are effectively
|
1328
|
+
unconstrained. Fragment identifier semantics are independent of the
|
1329
|
+
URI scheme and thus cannot be redefined by scheme specifications.
|
1330
|
+
|
1331
|
+
Individual media types may define their own restrictions on or
|
1332
|
+
structures within the fragment identifier syntax for specifying
|
1333
|
+
different types of subsets, views, or external references that are
|
1334
|
+
identifiable as secondary resources by that media type. If the
|
1335
|
+
primary resource has multiple representations, as is often the case
|
1336
|
+
for resources whose representation is selected based on attributes of
|
1337
|
+
the retrieval request (a.k.a., content negotiation), then whatever is
|
1338
|
+
identified by the fragment should be consistent across all of those
|
1339
|
+
representations. Each representation should either define the
|
1340
|
+
fragment so that it corresponds to the same secondary resource,
|
1341
|
+
regardless of how it is represented, or should leave the fragment
|
1342
|
+
undefined (i.e., not found).
|
1343
|
+
|
1344
|
+
|
1345
|
+
|
1346
|
+
Berners-Lee, et al. Standards Track [Page 24]
|
1347
|
+
|
1348
|
+
RFC 3986 URI Generic Syntax January 2005
|
1349
|
+
|
1350
|
+
|
1351
|
+
As with any URI, use of a fragment identifier component does not
|
1352
|
+
imply that a retrieval action will take place. A URI with a fragment
|
1353
|
+
identifier may be used to refer to the secondary resource without any
|
1354
|
+
implication that the primary resource is accessible or will ever be
|
1355
|
+
accessed.
|
1356
|
+
|
1357
|
+
Fragment identifiers have a special role in information retrieval
|
1358
|
+
systems as the primary form of client-side indirect referencing,
|
1359
|
+
allowing an author to specifically identify aspects of an existing
|
1360
|
+
resource that are only indirectly provided by the resource owner. As
|
1361
|
+
such, the fragment identifier is not used in the scheme-specific
|
1362
|
+
processing of a URI; instead, the fragment identifier is separated
|
1363
|
+
from the rest of the URI prior to a dereference, and thus the
|
1364
|
+
identifying information within the fragment itself is dereferenced
|
1365
|
+
solely by the user agent, regardless of the URI scheme. Although
|
1366
|
+
this separate handling is often perceived to be a loss of
|
1367
|
+
information, particularly for accurate redirection of references as
|
1368
|
+
resources move over time, it also serves to prevent information
|
1369
|
+
providers from denying reference authors the right to refer to
|
1370
|
+
information within a resource selectively. Indirect referencing also
|
1371
|
+
provides additional flexibility and extensibility to systems that use
|
1372
|
+
URIs, as new media types are easier to define and deploy than new
|
1373
|
+
schemes of identification.
|
1374
|
+
|
1375
|
+
The characters slash ("/") and question mark ("?") are allowed to
|
1376
|
+
represent data within the fragment identifier. Beware that some
|
1377
|
+
older, erroneous implementations may not handle this data correctly
|
1378
|
+
when it is used as the base URI for relative references (Section
|
1379
|
+
5.1).
|
1380
|
+
|
1381
|
+
4. Usage
|
1382
|
+
|
1383
|
+
When applications make reference to a URI, they do not always use the
|
1384
|
+
full form of reference defined by the "URI" syntax rule. To save
|
1385
|
+
space and take advantage of hierarchical locality, many Internet
|
1386
|
+
protocol elements and media type formats allow an abbreviation of a
|
1387
|
+
URI, whereas others restrict the syntax to a particular form of URI.
|
1388
|
+
We define the most common forms of reference syntax in this
|
1389
|
+
specification because they impact and depend upon the design of the
|
1390
|
+
generic syntax, requiring a uniform parsing algorithm in order to be
|
1391
|
+
interpreted consistently.
|
1392
|
+
|
1393
|
+
4.1. URI Reference
|
1394
|
+
|
1395
|
+
URI-reference is used to denote the most common usage of a resource
|
1396
|
+
identifier.
|
1397
|
+
|
1398
|
+
URI-reference = URI / relative-ref
|
1399
|
+
|
1400
|
+
|
1401
|
+
|
1402
|
+
Berners-Lee, et al. Standards Track [Page 25]
|
1403
|
+
|
1404
|
+
RFC 3986 URI Generic Syntax January 2005
|
1405
|
+
|
1406
|
+
|
1407
|
+
A URI-reference is either a URI or a relative reference. If the
|
1408
|
+
URI-reference's prefix does not match the syntax of a scheme followed
|
1409
|
+
by its colon separator, then the URI-reference is a relative
|
1410
|
+
reference.
|
1411
|
+
|
1412
|
+
A URI-reference is typically parsed first into the five URI
|
1413
|
+
components, in order to determine what components are present and
|
1414
|
+
whether the reference is relative. Then, each component is parsed
|
1415
|
+
for its subparts and their validation. The ABNF of URI-reference,
|
1416
|
+
along with the "first-match-wins" disambiguation rule, is sufficient
|
1417
|
+
to define a validating parser for the generic syntax. Readers
|
1418
|
+
familiar with regular expressions should see Appendix B for an
|
1419
|
+
example of a non-validating URI-reference parser that will take any
|
1420
|
+
given string and extract the URI components.
|
1421
|
+
|
1422
|
+
4.2. Relative Reference
|
1423
|
+
|
1424
|
+
A relative reference takes advantage of the hierarchical syntax
|
1425
|
+
(Section 1.2.3) to express a URI reference relative to the name space
|
1426
|
+
of another hierarchical URI.
|
1427
|
+
|
1428
|
+
relative-ref = relative-part [ "?" query ] [ "#" fragment ]
|
1429
|
+
|
1430
|
+
relative-part = "//" authority path-abempty
|
1431
|
+
/ path-absolute
|
1432
|
+
/ path-noscheme
|
1433
|
+
/ path-empty
|
1434
|
+
|
1435
|
+
The URI referred to by a relative reference, also known as the target
|
1436
|
+
URI, is obtained by applying the reference resolution algorithm of
|
1437
|
+
Section 5.
|
1438
|
+
|
1439
|
+
A relative reference that begins with two slash characters is termed
|
1440
|
+
a network-path reference; such references are rarely used. A
|
1441
|
+
relative reference that begins with a single slash character is
|
1442
|
+
termed an absolute-path reference. A relative reference that does
|
1443
|
+
not begin with a slash character is termed a relative-path reference.
|
1444
|
+
|
1445
|
+
A path segment that contains a colon character (e.g., "this:that")
|
1446
|
+
cannot be used as the first segment of a relative-path reference, as
|
1447
|
+
it would be mistaken for a scheme name. Such a segment must be
|
1448
|
+
preceded by a dot-segment (e.g., "./this:that") to make a relative-
|
1449
|
+
path reference.
|
1450
|
+
|
1451
|
+
|
1452
|
+
|
1453
|
+
|
1454
|
+
|
1455
|
+
|
1456
|
+
|
1457
|
+
|
1458
|
+
Berners-Lee, et al. Standards Track [Page 26]
|
1459
|
+
|
1460
|
+
RFC 3986 URI Generic Syntax January 2005
|
1461
|
+
|
1462
|
+
|
1463
|
+
4.3. Absolute URI
|
1464
|
+
|
1465
|
+
Some protocol elements allow only the absolute form of a URI without
|
1466
|
+
a fragment identifier. For example, defining a base URI for later
|
1467
|
+
use by relative references calls for an absolute-URI syntax rule that
|
1468
|
+
does not allow a fragment.
|
1469
|
+
|
1470
|
+
absolute-URI = scheme ":" hier-part [ "?" query ]
|
1471
|
+
|
1472
|
+
URI scheme specifications must define their own syntax so that all
|
1473
|
+
strings matching their scheme-specific syntax will also match the
|
1474
|
+
<absolute-URI> grammar. Scheme specifications will not define
|
1475
|
+
fragment identifier syntax or usage, regardless of its applicability
|
1476
|
+
to resources identifiable via that scheme, as fragment identification
|
1477
|
+
is orthogonal to scheme definition. However, scheme specifications
|
1478
|
+
are encouraged to include a wide range of examples, including
|
1479
|
+
examples that show use of the scheme's URIs with fragment identifiers
|
1480
|
+
when such usage is appropriate.
|
1481
|
+
|
1482
|
+
4.4. Same-Document Reference
|
1483
|
+
|
1484
|
+
When a URI reference refers to a URI that is, aside from its fragment
|
1485
|
+
component (if any), identical to the base URI (Section 5.1), that
|
1486
|
+
reference is called a "same-document" reference. The most frequent
|
1487
|
+
examples of same-document references are relative references that are
|
1488
|
+
empty or include only the number sign ("#") separator followed by a
|
1489
|
+
fragment identifier.
|
1490
|
+
|
1491
|
+
When a same-document reference is dereferenced for a retrieval
|
1492
|
+
action, the target of that reference is defined to be within the same
|
1493
|
+
entity (representation, document, or message) as the reference;
|
1494
|
+
therefore, a dereference should not result in a new retrieval action.
|
1495
|
+
|
1496
|
+
Normalization of the base and target URIs prior to their comparison,
|
1497
|
+
as described in Sections 6.2.2 and 6.2.3, is allowed but rarely
|
1498
|
+
performed in practice. Normalization may increase the set of same-
|
1499
|
+
document references, which may be of benefit to some caching
|
1500
|
+
applications. As such, reference authors should not assume that a
|
1501
|
+
slightly different, though equivalent, reference URI will (or will
|
1502
|
+
not) be interpreted as a same-document reference by any given
|
1503
|
+
application.
|
1504
|
+
|
1505
|
+
4.5. Suffix Reference
|
1506
|
+
|
1507
|
+
The URI syntax is designed for unambiguous reference to resources and
|
1508
|
+
extensibility via the URI scheme. However, as URI identification and
|
1509
|
+
usage have become commonplace, traditional media (television, radio,
|
1510
|
+
newspapers, billboards, etc.) have increasingly used a suffix of the
|
1511
|
+
|
1512
|
+
|
1513
|
+
|
1514
|
+
Berners-Lee, et al. Standards Track [Page 27]
|
1515
|
+
|
1516
|
+
RFC 3986 URI Generic Syntax January 2005
|
1517
|
+
|
1518
|
+
|
1519
|
+
URI as a reference, consisting of only the authority and path
|
1520
|
+
portions of the URI, such as
|
1521
|
+
|
1522
|
+
www.w3.org/Addressing/
|
1523
|
+
|
1524
|
+
or simply a DNS registered name on its own. Such references are
|
1525
|
+
primarily intended for human interpretation rather than for machines,
|
1526
|
+
with the assumption that context-based heuristics are sufficient to
|
1527
|
+
complete the URI (e.g., most registered names beginning with "www"
|
1528
|
+
are likely to have a URI prefix of "http://"). Although there is no
|
1529
|
+
standard set of heuristics for disambiguating a URI suffix, many
|
1530
|
+
client implementations allow them to be entered by the user and
|
1531
|
+
heuristically resolved.
|
1532
|
+
|
1533
|
+
Although this practice of using suffix references is common, it
|
1534
|
+
should be avoided whenever possible and should never be used in
|
1535
|
+
situations where long-term references are expected. The heuristics
|
1536
|
+
noted above will change over time, particularly when a new URI scheme
|
1537
|
+
becomes popular, and are often incorrect when used out of context.
|
1538
|
+
Furthermore, they can lead to security issues along the lines of
|
1539
|
+
those described in [RFC1535].
|
1540
|
+
|
1541
|
+
As a URI suffix has the same syntax as a relative-path reference, a
|
1542
|
+
suffix reference cannot be used in contexts where a relative
|
1543
|
+
reference is expected. As a result, suffix references are limited to
|
1544
|
+
places where there is no defined base URI, such as dialog boxes and
|
1545
|
+
off-line advertisements.
|
1546
|
+
|
1547
|
+
5. Reference Resolution
|
1548
|
+
|
1549
|
+
This section defines the process of resolving a URI reference within
|
1550
|
+
a context that allows relative references so that the result is a
|
1551
|
+
string matching the <URI> syntax rule of Section 3.
|
1552
|
+
|
1553
|
+
5.1. Establishing a Base URI
|
1554
|
+
|
1555
|
+
The term "relative" implies that a "base URI" exists against which
|
1556
|
+
the relative reference is applied. Aside from fragment-only
|
1557
|
+
references (Section 4.4), relative references are only usable when a
|
1558
|
+
base URI is known. A base URI must be established by the parser
|
1559
|
+
prior to parsing URI references that might be relative. A base URI
|
1560
|
+
must conform to the <absolute-URI> syntax rule (Section 4.3). If the
|
1561
|
+
base URI is obtained from a URI reference, then that reference must
|
1562
|
+
be converted to absolute form and stripped of any fragment component
|
1563
|
+
prior to its use as a base URI.
|
1564
|
+
|
1565
|
+
|
1566
|
+
|
1567
|
+
|
1568
|
+
|
1569
|
+
|
1570
|
+
Berners-Lee, et al. Standards Track [Page 28]
|
1571
|
+
|
1572
|
+
RFC 3986 URI Generic Syntax January 2005
|
1573
|
+
|
1574
|
+
|
1575
|
+
The base URI of a reference can be established in one of four ways,
|
1576
|
+
discussed below in order of precedence. The order of precedence can
|
1577
|
+
be thought of in terms of layers, where the innermost defined base
|
1578
|
+
URI has the highest precedence. This can be visualized graphically
|
1579
|
+
as follows:
|
1580
|
+
|
1581
|
+
.----------------------------------------------------------.
|
1582
|
+
| .----------------------------------------------------. |
|
1583
|
+
| | .----------------------------------------------. | |
|
1584
|
+
| | | .----------------------------------------. | | |
|
1585
|
+
| | | | .----------------------------------. | | | |
|
1586
|
+
| | | | | <relative-reference> | | | | |
|
1587
|
+
| | | | `----------------------------------' | | | |
|
1588
|
+
| | | | (5.1.1) Base URI embedded in content | | | |
|
1589
|
+
| | | `----------------------------------------' | | |
|
1590
|
+
| | | (5.1.2) Base URI of the encapsulating entity | | |
|
1591
|
+
| | | (message, representation, or none) | | |
|
1592
|
+
| | `----------------------------------------------' | |
|
1593
|
+
| | (5.1.3) URI used to retrieve the entity | |
|
1594
|
+
| `----------------------------------------------------' |
|
1595
|
+
| (5.1.4) Default Base URI (application-dependent) |
|
1596
|
+
`----------------------------------------------------------'
|
1597
|
+
|
1598
|
+
5.1.1. Base URI Embedded in Content
|
1599
|
+
|
1600
|
+
Within certain media types, a base URI for relative references can be
|
1601
|
+
embedded within the content itself so that it can be readily obtained
|
1602
|
+
by a parser. This can be useful for descriptive documents, such as
|
1603
|
+
tables of contents, which may be transmitted to others through
|
1604
|
+
protocols other than their usual retrieval context (e.g., email or
|
1605
|
+
USENET news).
|
1606
|
+
|
1607
|
+
It is beyond the scope of this specification to specify how, for each
|
1608
|
+
media type, a base URI can be embedded. The appropriate syntax, when
|
1609
|
+
available, is described by the data format specification associated
|
1610
|
+
with each media type.
|
1611
|
+
|
1612
|
+
5.1.2. Base URI from the Encapsulating Entity
|
1613
|
+
|
1614
|
+
If no base URI is embedded, the base URI is defined by the
|
1615
|
+
representation's retrieval context. For a document that is enclosed
|
1616
|
+
within another entity, such as a message or archive, the retrieval
|
1617
|
+
context is that entity. Thus, the default base URI of a
|
1618
|
+
representation is the base URI of the entity in which the
|
1619
|
+
representation is encapsulated.
|
1620
|
+
|
1621
|
+
|
1622
|
+
|
1623
|
+
|
1624
|
+
|
1625
|
+
|
1626
|
+
Berners-Lee, et al. Standards Track [Page 29]
|
1627
|
+
|
1628
|
+
RFC 3986 URI Generic Syntax January 2005
|
1629
|
+
|
1630
|
+
|
1631
|
+
A mechanism for embedding a base URI within MIME container types
|
1632
|
+
(e.g., the message and multipart types) is defined by MHTML
|
1633
|
+
[RFC2557]. Protocols that do not use the MIME message header syntax,
|
1634
|
+
but that do allow some form of tagged metadata to be included within
|
1635
|
+
messages, may define their own syntax for defining a base URI as part
|
1636
|
+
of a message.
|
1637
|
+
|
1638
|
+
5.1.3. Base URI from the Retrieval URI
|
1639
|
+
|
1640
|
+
If no base URI is embedded and the representation is not encapsulated
|
1641
|
+
within some other entity, then, if a URI was used to retrieve the
|
1642
|
+
representation, that URI shall be considered the base URI. Note that
|
1643
|
+
if the retrieval was the result of a redirected request, the last URI
|
1644
|
+
used (i.e., the URI that resulted in the actual retrieval of the
|
1645
|
+
representation) is the base URI.
|
1646
|
+
|
1647
|
+
5.1.4. Default Base URI
|
1648
|
+
|
1649
|
+
If none of the conditions described above apply, then the base URI is
|
1650
|
+
defined by the context of the application. As this definition is
|
1651
|
+
necessarily application-dependent, failing to define a base URI by
|
1652
|
+
using one of the other methods may result in the same content being
|
1653
|
+
interpreted differently by different types of applications.
|
1654
|
+
|
1655
|
+
A sender of a representation containing relative references is
|
1656
|
+
responsible for ensuring that a base URI for those references can be
|
1657
|
+
established. Aside from fragment-only references, relative
|
1658
|
+
references can only be used reliably in situations where the base URI
|
1659
|
+
is well defined.
|
1660
|
+
|
1661
|
+
5.2. Relative Resolution
|
1662
|
+
|
1663
|
+
This section describes an algorithm for converting a URI reference
|
1664
|
+
that might be relative to a given base URI into the parsed components
|
1665
|
+
of the reference's target. The components can then be recomposed, as
|
1666
|
+
described in Section 5.3, to form the target URI. This algorithm
|
1667
|
+
provides definitive results that can be used to test the output of
|
1668
|
+
other implementations. Applications may implement relative reference
|
1669
|
+
resolution by using some other algorithm, provided that the results
|
1670
|
+
match what would be given by this one.
|
1671
|
+
|
1672
|
+
|
1673
|
+
|
1674
|
+
|
1675
|
+
|
1676
|
+
|
1677
|
+
|
1678
|
+
|
1679
|
+
|
1680
|
+
|
1681
|
+
|
1682
|
+
Berners-Lee, et al. Standards Track [Page 30]
|
1683
|
+
|
1684
|
+
RFC 3986 URI Generic Syntax January 2005
|
1685
|
+
|
1686
|
+
|
1687
|
+
5.2.1. Pre-parse the Base URI
|
1688
|
+
|
1689
|
+
The base URI (Base) is established according to the procedure of
|
1690
|
+
Section 5.1 and parsed into the five main components described in
|
1691
|
+
Section 3. Note that only the scheme component is required to be
|
1692
|
+
present in a base URI; the other components may be empty or
|
1693
|
+
undefined. A component is undefined if its associated delimiter does
|
1694
|
+
not appear in the URI reference; the path component is never
|
1695
|
+
undefined, though it may be empty.
|
1696
|
+
|
1697
|
+
Normalization of the base URI, as described in Sections 6.2.2 and
|
1698
|
+
6.2.3, is optional. A URI reference must be transformed to its
|
1699
|
+
target URI before it can be normalized.
|
1700
|
+
|
1701
|
+
5.2.2. Transform References
|
1702
|
+
|
1703
|
+
For each URI reference (R), the following pseudocode describes an
|
1704
|
+
algorithm for transforming R into its target URI (T):
|
1705
|
+
|
1706
|
+
-- The URI reference is parsed into the five URI components
|
1707
|
+
--
|
1708
|
+
(R.scheme, R.authority, R.path, R.query, R.fragment) = parse(R);
|
1709
|
+
|
1710
|
+
-- A non-strict parser may ignore a scheme in the reference
|
1711
|
+
-- if it is identical to the base URI's scheme.
|
1712
|
+
--
|
1713
|
+
if ((not strict) and (R.scheme == Base.scheme)) then
|
1714
|
+
undefine(R.scheme);
|
1715
|
+
endif;
|
1716
|
+
|
1717
|
+
|
1718
|
+
|
1719
|
+
|
1720
|
+
|
1721
|
+
|
1722
|
+
|
1723
|
+
|
1724
|
+
|
1725
|
+
|
1726
|
+
|
1727
|
+
|
1728
|
+
|
1729
|
+
|
1730
|
+
|
1731
|
+
|
1732
|
+
|
1733
|
+
|
1734
|
+
|
1735
|
+
|
1736
|
+
|
1737
|
+
|
1738
|
+
Berners-Lee, et al. Standards Track [Page 31]
|
1739
|
+
|
1740
|
+
RFC 3986 URI Generic Syntax January 2005
|
1741
|
+
|
1742
|
+
|
1743
|
+
if defined(R.scheme) then
|
1744
|
+
T.scheme = R.scheme;
|
1745
|
+
T.authority = R.authority;
|
1746
|
+
T.path = remove_dot_segments(R.path);
|
1747
|
+
T.query = R.query;
|
1748
|
+
else
|
1749
|
+
if defined(R.authority) then
|
1750
|
+
T.authority = R.authority;
|
1751
|
+
T.path = remove_dot_segments(R.path);
|
1752
|
+
T.query = R.query;
|
1753
|
+
else
|
1754
|
+
if (R.path == "") then
|
1755
|
+
T.path = Base.path;
|
1756
|
+
if defined(R.query) then
|
1757
|
+
T.query = R.query;
|
1758
|
+
else
|
1759
|
+
T.query = Base.query;
|
1760
|
+
endif;
|
1761
|
+
else
|
1762
|
+
if (R.path starts-with "/") then
|
1763
|
+
T.path = remove_dot_segments(R.path);
|
1764
|
+
else
|
1765
|
+
T.path = merge(Base.path, R.path);
|
1766
|
+
T.path = remove_dot_segments(T.path);
|
1767
|
+
endif;
|
1768
|
+
T.query = R.query;
|
1769
|
+
endif;
|
1770
|
+
T.authority = Base.authority;
|
1771
|
+
endif;
|
1772
|
+
T.scheme = Base.scheme;
|
1773
|
+
endif;
|
1774
|
+
|
1775
|
+
T.fragment = R.fragment;
|
1776
|
+
|
1777
|
+
5.2.3. Merge Paths
|
1778
|
+
|
1779
|
+
The pseudocode above refers to a "merge" routine for merging a
|
1780
|
+
relative-path reference with the path of the base URI. This is
|
1781
|
+
accomplished as follows:
|
1782
|
+
|
1783
|
+
o If the base URI has a defined authority component and an empty
|
1784
|
+
path, then return a string consisting of "/" concatenated with the
|
1785
|
+
reference's path; otherwise,
|
1786
|
+
|
1787
|
+
|
1788
|
+
|
1789
|
+
|
1790
|
+
|
1791
|
+
|
1792
|
+
|
1793
|
+
|
1794
|
+
Berners-Lee, et al. Standards Track [Page 32]
|
1795
|
+
|
1796
|
+
RFC 3986 URI Generic Syntax January 2005
|
1797
|
+
|
1798
|
+
|
1799
|
+
o return a string consisting of the reference's path component
|
1800
|
+
appended to all but the last segment of the base URI's path (i.e.,
|
1801
|
+
excluding any characters after the right-most "/" in the base URI
|
1802
|
+
path, or excluding the entire base URI path if it does not contain
|
1803
|
+
any "/" characters).
|
1804
|
+
|
1805
|
+
5.2.4. Remove Dot Segments
|
1806
|
+
|
1807
|
+
The pseudocode also refers to a "remove_dot_segments" routine for
|
1808
|
+
interpreting and removing the special "." and ".." complete path
|
1809
|
+
segments from a referenced path. This is done after the path is
|
1810
|
+
extracted from a reference, whether or not the path was relative, in
|
1811
|
+
order to remove any invalid or extraneous dot-segments prior to
|
1812
|
+
forming the target URI. Although there are many ways to accomplish
|
1813
|
+
this removal process, we describe a simple method using two string
|
1814
|
+
buffers.
|
1815
|
+
|
1816
|
+
1. The input buffer is initialized with the now-appended path
|
1817
|
+
components and the output buffer is initialized to the empty
|
1818
|
+
string.
|
1819
|
+
|
1820
|
+
2. While the input buffer is not empty, loop as follows:
|
1821
|
+
|
1822
|
+
A. If the input buffer begins with a prefix of "../" or "./",
|
1823
|
+
then remove that prefix from the input buffer; otherwise,
|
1824
|
+
|
1825
|
+
B. if the input buffer begins with a prefix of "/./" or "/.",
|
1826
|
+
where "." is a complete path segment, then replace that
|
1827
|
+
prefix with "/" in the input buffer; otherwise,
|
1828
|
+
|
1829
|
+
C. if the input buffer begins with a prefix of "/../" or "/..",
|
1830
|
+
where ".." is a complete path segment, then replace that
|
1831
|
+
prefix with "/" in the input buffer and remove the last
|
1832
|
+
segment and its preceding "/" (if any) from the output
|
1833
|
+
buffer; otherwise,
|
1834
|
+
|
1835
|
+
D. if the input buffer consists only of "." or "..", then remove
|
1836
|
+
that from the input buffer; otherwise,
|
1837
|
+
|
1838
|
+
E. move the first path segment in the input buffer to the end of
|
1839
|
+
the output buffer, including the initial "/" character (if
|
1840
|
+
any) and any subsequent characters up to, but not including,
|
1841
|
+
the next "/" character or the end of the input buffer.
|
1842
|
+
|
1843
|
+
3. Finally, the output buffer is returned as the result of
|
1844
|
+
remove_dot_segments.
|
1845
|
+
|
1846
|
+
|
1847
|
+
|
1848
|
+
|
1849
|
+
|
1850
|
+
Berners-Lee, et al. Standards Track [Page 33]
|
1851
|
+
|
1852
|
+
RFC 3986 URI Generic Syntax January 2005
|
1853
|
+
|
1854
|
+
|
1855
|
+
Note that dot-segments are intended for use in URI references to
|
1856
|
+
express an identifier relative to the hierarchy of names in the base
|
1857
|
+
URI. The remove_dot_segments algorithm respects that hierarchy by
|
1858
|
+
removing extra dot-segments rather than treat them as an error or
|
1859
|
+
leaving them to be misinterpreted by dereference implementations.
|
1860
|
+
|
1861
|
+
The following illustrates how the above steps are applied for two
|
1862
|
+
examples of merged paths, showing the state of the two buffers after
|
1863
|
+
each step.
|
1864
|
+
|
1865
|
+
STEP OUTPUT BUFFER INPUT BUFFER
|
1866
|
+
|
1867
|
+
1 : /a/b/c/./../../g
|
1868
|
+
2E: /a /b/c/./../../g
|
1869
|
+
2E: /a/b /c/./../../g
|
1870
|
+
2E: /a/b/c /./../../g
|
1871
|
+
2B: /a/b/c /../../g
|
1872
|
+
2C: /a/b /../g
|
1873
|
+
2C: /a /g
|
1874
|
+
2E: /a/g
|
1875
|
+
|
1876
|
+
STEP OUTPUT BUFFER INPUT BUFFER
|
1877
|
+
|
1878
|
+
1 : mid/content=5/../6
|
1879
|
+
2E: mid /content=5/../6
|
1880
|
+
2E: mid/content=5 /../6
|
1881
|
+
2C: mid /6
|
1882
|
+
2E: mid/6
|
1883
|
+
|
1884
|
+
Some applications may find it more efficient to implement the
|
1885
|
+
remove_dot_segments algorithm by using two segment stacks rather than
|
1886
|
+
strings.
|
1887
|
+
|
1888
|
+
Note: Beware that some older, erroneous implementations will fail
|
1889
|
+
to separate a reference's query component from its path component
|
1890
|
+
prior to merging the base and reference paths, resulting in an
|
1891
|
+
interoperability failure if the query component contains the
|
1892
|
+
strings "/../" or "/./".
|
1893
|
+
|
1894
|
+
|
1895
|
+
|
1896
|
+
|
1897
|
+
|
1898
|
+
|
1899
|
+
|
1900
|
+
|
1901
|
+
|
1902
|
+
|
1903
|
+
|
1904
|
+
|
1905
|
+
|
1906
|
+
Berners-Lee, et al. Standards Track [Page 34]
|
1907
|
+
|
1908
|
+
RFC 3986 URI Generic Syntax January 2005
|
1909
|
+
|
1910
|
+
|
1911
|
+
5.3. Component Recomposition
|
1912
|
+
|
1913
|
+
Parsed URI components can be recomposed to obtain the corresponding
|
1914
|
+
URI reference string. Using pseudocode, this would be:
|
1915
|
+
|
1916
|
+
result = ""
|
1917
|
+
|
1918
|
+
if defined(scheme) then
|
1919
|
+
append scheme to result;
|
1920
|
+
append ":" to result;
|
1921
|
+
endif;
|
1922
|
+
|
1923
|
+
if defined(authority) then
|
1924
|
+
append "//" to result;
|
1925
|
+
append authority to result;
|
1926
|
+
endif;
|
1927
|
+
|
1928
|
+
append path to result;
|
1929
|
+
|
1930
|
+
if defined(query) then
|
1931
|
+
append "?" to result;
|
1932
|
+
append query to result;
|
1933
|
+
endif;
|
1934
|
+
|
1935
|
+
if defined(fragment) then
|
1936
|
+
append "#" to result;
|
1937
|
+
append fragment to result;
|
1938
|
+
endif;
|
1939
|
+
|
1940
|
+
return result;
|
1941
|
+
|
1942
|
+
Note that we are careful to preserve the distinction between a
|
1943
|
+
component that is undefined, meaning that its separator was not
|
1944
|
+
present in the reference, and a component that is empty, meaning that
|
1945
|
+
the separator was present and was immediately followed by the next
|
1946
|
+
component separator or the end of the reference.
|
1947
|
+
|
1948
|
+
5.4. Reference Resolution Examples
|
1949
|
+
|
1950
|
+
Within a representation with a well defined base URI of
|
1951
|
+
|
1952
|
+
http://a/b/c/d;p?q
|
1953
|
+
|
1954
|
+
a relative reference is transformed to its target URI as follows.
|
1955
|
+
|
1956
|
+
|
1957
|
+
|
1958
|
+
|
1959
|
+
|
1960
|
+
|
1961
|
+
|
1962
|
+
Berners-Lee, et al. Standards Track [Page 35]
|
1963
|
+
|
1964
|
+
RFC 3986 URI Generic Syntax January 2005
|
1965
|
+
|
1966
|
+
|
1967
|
+
5.4.1. Normal Examples
|
1968
|
+
|
1969
|
+
"g:h" = "g:h"
|
1970
|
+
"g" = "http://a/b/c/g"
|
1971
|
+
"./g" = "http://a/b/c/g"
|
1972
|
+
"g/" = "http://a/b/c/g/"
|
1973
|
+
"/g" = "http://a/g"
|
1974
|
+
"//g" = "http://g"
|
1975
|
+
"?y" = "http://a/b/c/d;p?y"
|
1976
|
+
"g?y" = "http://a/b/c/g?y"
|
1977
|
+
"#s" = "http://a/b/c/d;p?q#s"
|
1978
|
+
"g#s" = "http://a/b/c/g#s"
|
1979
|
+
"g?y#s" = "http://a/b/c/g?y#s"
|
1980
|
+
";x" = "http://a/b/c/;x"
|
1981
|
+
"g;x" = "http://a/b/c/g;x"
|
1982
|
+
"g;x?y#s" = "http://a/b/c/g;x?y#s"
|
1983
|
+
"" = "http://a/b/c/d;p?q"
|
1984
|
+
"." = "http://a/b/c/"
|
1985
|
+
"./" = "http://a/b/c/"
|
1986
|
+
".." = "http://a/b/"
|
1987
|
+
"../" = "http://a/b/"
|
1988
|
+
"../g" = "http://a/b/g"
|
1989
|
+
"../.." = "http://a/"
|
1990
|
+
"../../" = "http://a/"
|
1991
|
+
"../../g" = "http://a/g"
|
1992
|
+
|
1993
|
+
5.4.2. Abnormal Examples
|
1994
|
+
|
1995
|
+
Although the following abnormal examples are unlikely to occur in
|
1996
|
+
normal practice, all URI parsers should be capable of resolving them
|
1997
|
+
consistently. Each example uses the same base as that above.
|
1998
|
+
|
1999
|
+
Parsers must be careful in handling cases where there are more ".."
|
2000
|
+
segments in a relative-path reference than there are hierarchical
|
2001
|
+
levels in the base URI's path. Note that the ".." syntax cannot be
|
2002
|
+
used to change the authority component of a URI.
|
2003
|
+
|
2004
|
+
"../../../g" = "http://a/g"
|
2005
|
+
"../../../../g" = "http://a/g"
|
2006
|
+
|
2007
|
+
|
2008
|
+
|
2009
|
+
|
2010
|
+
|
2011
|
+
|
2012
|
+
|
2013
|
+
|
2014
|
+
|
2015
|
+
|
2016
|
+
|
2017
|
+
|
2018
|
+
Berners-Lee, et al. Standards Track [Page 36]
|
2019
|
+
|
2020
|
+
RFC 3986 URI Generic Syntax January 2005
|
2021
|
+
|
2022
|
+
|
2023
|
+
Similarly, parsers must remove the dot-segments "." and ".." when
|
2024
|
+
they are complete components of a path, but not when they are only
|
2025
|
+
part of a segment.
|
2026
|
+
|
2027
|
+
"/./g" = "http://a/g"
|
2028
|
+
"/../g" = "http://a/g"
|
2029
|
+
"g." = "http://a/b/c/g."
|
2030
|
+
".g" = "http://a/b/c/.g"
|
2031
|
+
"g.." = "http://a/b/c/g.."
|
2032
|
+
"..g" = "http://a/b/c/..g"
|
2033
|
+
|
2034
|
+
Less likely are cases where the relative reference uses unnecessary
|
2035
|
+
or nonsensical forms of the "." and ".." complete path segments.
|
2036
|
+
|
2037
|
+
"./../g" = "http://a/b/g"
|
2038
|
+
"./g/." = "http://a/b/c/g/"
|
2039
|
+
"g/./h" = "http://a/b/c/g/h"
|
2040
|
+
"g/../h" = "http://a/b/c/h"
|
2041
|
+
"g;x=1/./y" = "http://a/b/c/g;x=1/y"
|
2042
|
+
"g;x=1/../y" = "http://a/b/c/y"
|
2043
|
+
|
2044
|
+
Some applications fail to separate the reference's query and/or
|
2045
|
+
fragment components from the path component before merging it with
|
2046
|
+
the base path and removing dot-segments. This error is rarely
|
2047
|
+
noticed, as typical usage of a fragment never includes the hierarchy
|
2048
|
+
("/") character and the query component is not normally used within
|
2049
|
+
relative references.
|
2050
|
+
|
2051
|
+
"g?y/./x" = "http://a/b/c/g?y/./x"
|
2052
|
+
"g?y/../x" = "http://a/b/c/g?y/../x"
|
2053
|
+
"g#s/./x" = "http://a/b/c/g#s/./x"
|
2054
|
+
"g#s/../x" = "http://a/b/c/g#s/../x"
|
2055
|
+
|
2056
|
+
Some parsers allow the scheme name to be present in a relative
|
2057
|
+
reference if it is the same as the base URI scheme. This is
|
2058
|
+
considered to be a loophole in prior specifications of partial URI
|
2059
|
+
[RFC1630]. Its use should be avoided but is allowed for backward
|
2060
|
+
compatibility.
|
2061
|
+
|
2062
|
+
"http:g" = "http:g" ; for strict parsers
|
2063
|
+
/ "http://a/b/c/g" ; for backward compatibility
|
2064
|
+
|
2065
|
+
|
2066
|
+
|
2067
|
+
|
2068
|
+
|
2069
|
+
|
2070
|
+
|
2071
|
+
|
2072
|
+
|
2073
|
+
|
2074
|
+
Berners-Lee, et al. Standards Track [Page 37]
|
2075
|
+
|
2076
|
+
RFC 3986 URI Generic Syntax January 2005
|
2077
|
+
|
2078
|
+
|
2079
|
+
6. Normalization and Comparison
|
2080
|
+
|
2081
|
+
One of the most common operations on URIs is simple comparison:
|
2082
|
+
determining whether two URIs are equivalent without using the URIs to
|
2083
|
+
access their respective resource(s). A comparison is performed every
|
2084
|
+
time a response cache is accessed, a browser checks its history to
|
2085
|
+
color a link, or an XML parser processes tags within a namespace.
|
2086
|
+
Extensive normalization prior to comparison of URIs is often used by
|
2087
|
+
spiders and indexing engines to prune a search space or to reduce
|
2088
|
+
duplication of request actions and response storage.
|
2089
|
+
|
2090
|
+
URI comparison is performed for some particular purpose. Protocols
|
2091
|
+
or implementations that compare URIs for different purposes will
|
2092
|
+
often be subject to differing design trade-offs in regards to how
|
2093
|
+
much effort should be spent in reducing aliased identifiers. This
|
2094
|
+
section describes various methods that may be used to compare URIs,
|
2095
|
+
the trade-offs between them, and the types of applications that might
|
2096
|
+
use them.
|
2097
|
+
|
2098
|
+
6.1. Equivalence
|
2099
|
+
|
2100
|
+
Because URIs exist to identify resources, presumably they should be
|
2101
|
+
considered equivalent when they identify the same resource. However,
|
2102
|
+
this definition of equivalence is not of much practical use, as there
|
2103
|
+
is no way for an implementation to compare two resources unless it
|
2104
|
+
has full knowledge or control of them. For this reason,
|
2105
|
+
determination of equivalence or difference of URIs is based on string
|
2106
|
+
comparison, perhaps augmented by reference to additional rules
|
2107
|
+
provided by URI scheme definitions. We use the terms "different" and
|
2108
|
+
"equivalent" to describe the possible outcomes of such comparisons,
|
2109
|
+
but there are many application-dependent versions of equivalence.
|
2110
|
+
|
2111
|
+
Even though it is possible to determine that two URIs are equivalent,
|
2112
|
+
URI comparison is not sufficient to determine whether two URIs
|
2113
|
+
identify different resources. For example, an owner of two different
|
2114
|
+
domain names could decide to serve the same resource from both,
|
2115
|
+
resulting in two different URIs. Therefore, comparison methods are
|
2116
|
+
designed to minimize false negatives while strictly avoiding false
|
2117
|
+
positives.
|
2118
|
+
|
2119
|
+
In testing for equivalence, applications should not directly compare
|
2120
|
+
relative references; the references should be converted to their
|
2121
|
+
respective target URIs before comparison. When URIs are compared to
|
2122
|
+
select (or avoid) a network action, such as retrieval of a
|
2123
|
+
representation, fragment components (if any) should be excluded from
|
2124
|
+
the comparison.
|
2125
|
+
|
2126
|
+
|
2127
|
+
|
2128
|
+
|
2129
|
+
|
2130
|
+
Berners-Lee, et al. Standards Track [Page 38]
|
2131
|
+
|
2132
|
+
RFC 3986 URI Generic Syntax January 2005
|
2133
|
+
|
2134
|
+
|
2135
|
+
6.2. Comparison Ladder
|
2136
|
+
|
2137
|
+
A variety of methods are used in practice to test URI equivalence.
|
2138
|
+
These methods fall into a range, distinguished by the amount of
|
2139
|
+
processing required and the degree to which the probability of false
|
2140
|
+
negatives is reduced. As noted above, false negatives cannot be
|
2141
|
+
eliminated. In practice, their probability can be reduced, but this
|
2142
|
+
reduction requires more processing and is not cost-effective for all
|
2143
|
+
applications.
|
2144
|
+
|
2145
|
+
If this range of comparison practices is considered as a ladder, the
|
2146
|
+
following discussion will climb the ladder, starting with practices
|
2147
|
+
that are cheap but have a relatively higher chance of producing false
|
2148
|
+
negatives, and proceeding to those that have higher computational
|
2149
|
+
cost and lower risk of false negatives.
|
2150
|
+
|
2151
|
+
6.2.1. Simple String Comparison
|
2152
|
+
|
2153
|
+
If two URIs, when considered as character strings, are identical,
|
2154
|
+
then it is safe to conclude that they are equivalent. This type of
|
2155
|
+
equivalence test has very low computational cost and is in wide use
|
2156
|
+
in a variety of applications, particularly in the domain of parsing.
|
2157
|
+
|
2158
|
+
Testing strings for equivalence requires some basic precautions.
|
2159
|
+
This procedure is often referred to as "bit-for-bit" or
|
2160
|
+
"byte-for-byte" comparison, which is potentially misleading. Testing
|
2161
|
+
strings for equality is normally based on pair comparison of the
|
2162
|
+
characters that make up the strings, starting from the first and
|
2163
|
+
proceeding until both strings are exhausted and all characters are
|
2164
|
+
found to be equal, until a pair of characters compares unequal, or
|
2165
|
+
until one of the strings is exhausted before the other.
|
2166
|
+
|
2167
|
+
This character comparison requires that each pair of characters be
|
2168
|
+
put in comparable form. For example, should one URI be stored in a
|
2169
|
+
byte array in EBCDIC encoding and the second in a Java String object
|
2170
|
+
(UTF-16), bit-for-bit comparisons applied naively will produce
|
2171
|
+
errors. It is better to speak of equality on a character-for-
|
2172
|
+
character basis rather than on a byte-for-byte or bit-for-bit basis.
|
2173
|
+
In practical terms, character-by-character comparisons should be done
|
2174
|
+
codepoint-by-codepoint after conversion to a common character
|
2175
|
+
encoding.
|
2176
|
+
|
2177
|
+
False negatives are caused by the production and use of URI aliases.
|
2178
|
+
Unnecessary aliases can be reduced, regardless of the comparison
|
2179
|
+
method, by consistently providing URI references in an already-
|
2180
|
+
normalized form (i.e., a form identical to what would be produced
|
2181
|
+
after normalization is applied, as described below).
|
2182
|
+
|
2183
|
+
|
2184
|
+
|
2185
|
+
|
2186
|
+
Berners-Lee, et al. Standards Track [Page 39]
|
2187
|
+
|
2188
|
+
RFC 3986 URI Generic Syntax January 2005
|
2189
|
+
|
2190
|
+
|
2191
|
+
Protocols and data formats often limit some URI comparisons to simple
|
2192
|
+
string comparison, based on the theory that people and
|
2193
|
+
implementations will, in their own best interest, be consistent in
|
2194
|
+
providing URI references, or at least consistent enough to negate any
|
2195
|
+
efficiency that might be obtained from further normalization.
|
2196
|
+
|
2197
|
+
6.2.2. Syntax-Based Normalization
|
2198
|
+
|
2199
|
+
Implementations may use logic based on the definitions provided by
|
2200
|
+
this specification to reduce the probability of false negatives.
|
2201
|
+
This processing is moderately higher in cost than character-for-
|
2202
|
+
character string comparison. For example, an application using this
|
2203
|
+
approach could reasonably consider the following two URIs equivalent:
|
2204
|
+
|
2205
|
+
example://a/b/c/%7Bfoo%7D
|
2206
|
+
eXAMPLE://a/./b/../b/%63/%7bfoo%7d
|
2207
|
+
|
2208
|
+
Web user agents, such as browsers, typically apply this type of URI
|
2209
|
+
normalization when determining whether a cached response is
|
2210
|
+
available. Syntax-based normalization includes such techniques as
|
2211
|
+
case normalization, percent-encoding normalization, and removal of
|
2212
|
+
dot-segments.
|
2213
|
+
|
2214
|
+
6.2.2.1. Case Normalization
|
2215
|
+
|
2216
|
+
For all URIs, the hexadecimal digits within a percent-encoding
|
2217
|
+
triplet (e.g., "%3a" versus "%3A") are case-insensitive and therefore
|
2218
|
+
should be normalized to use uppercase letters for the digits A-F.
|
2219
|
+
|
2220
|
+
When a URI uses components of the generic syntax, the component
|
2221
|
+
syntax equivalence rules always apply; namely, that the scheme and
|
2222
|
+
host are case-insensitive and therefore should be normalized to
|
2223
|
+
lowercase. For example, the URI <HTTP://www.EXAMPLE.com/> is
|
2224
|
+
equivalent to <http://www.example.com/>. The other generic syntax
|
2225
|
+
components are assumed to be case-sensitive unless specifically
|
2226
|
+
defined otherwise by the scheme (see Section 6.2.3).
|
2227
|
+
|
2228
|
+
6.2.2.2. Percent-Encoding Normalization
|
2229
|
+
|
2230
|
+
The percent-encoding mechanism (Section 2.1) is a frequent source of
|
2231
|
+
variance among otherwise identical URIs. In addition to the case
|
2232
|
+
normalization issue noted above, some URI producers percent-encode
|
2233
|
+
octets that do not require percent-encoding, resulting in URIs that
|
2234
|
+
are equivalent to their non-encoded counterparts. These URIs should
|
2235
|
+
be normalized by decoding any percent-encoded octet that corresponds
|
2236
|
+
to an unreserved character, as described in Section 2.3.
|
2237
|
+
|
2238
|
+
|
2239
|
+
|
2240
|
+
|
2241
|
+
|
2242
|
+
Berners-Lee, et al. Standards Track [Page 40]
|
2243
|
+
|
2244
|
+
RFC 3986 URI Generic Syntax January 2005
|
2245
|
+
|
2246
|
+
|
2247
|
+
6.2.2.3. Path Segment Normalization
|
2248
|
+
|
2249
|
+
The complete path segments "." and ".." are intended only for use
|
2250
|
+
within relative references (Section 4.1) and are removed as part of
|
2251
|
+
the reference resolution process (Section 5.2). However, some
|
2252
|
+
deployed implementations incorrectly assume that reference resolution
|
2253
|
+
is not necessary when the reference is already a URI and thus fail to
|
2254
|
+
remove dot-segments when they occur in non-relative paths. URI
|
2255
|
+
normalizers should remove dot-segments by applying the
|
2256
|
+
remove_dot_segments algorithm to the path, as described in
|
2257
|
+
Section 5.2.4.
|
2258
|
+
|
2259
|
+
6.2.3. Scheme-Based Normalization
|
2260
|
+
|
2261
|
+
The syntax and semantics of URIs vary from scheme to scheme, as
|
2262
|
+
described by the defining specification for each scheme.
|
2263
|
+
Implementations may use scheme-specific rules, at further processing
|
2264
|
+
cost, to reduce the probability of false negatives. For example,
|
2265
|
+
because the "http" scheme makes use of an authority component, has a
|
2266
|
+
default port of "80", and defines an empty path to be equivalent to
|
2267
|
+
"/", the following four URIs are equivalent:
|
2268
|
+
|
2269
|
+
http://example.com
|
2270
|
+
http://example.com/
|
2271
|
+
http://example.com:/
|
2272
|
+
http://example.com:80/
|
2273
|
+
|
2274
|
+
In general, a URI that uses the generic syntax for authority with an
|
2275
|
+
empty path should be normalized to a path of "/". Likewise, an
|
2276
|
+
explicit ":port", for which the port is empty or the default for the
|
2277
|
+
scheme, is equivalent to one where the port and its ":" delimiter are
|
2278
|
+
elided and thus should be removed by scheme-based normalization. For
|
2279
|
+
example, the second URI above is the normal form for the "http"
|
2280
|
+
scheme.
|
2281
|
+
|
2282
|
+
Another case where normalization varies by scheme is in the handling
|
2283
|
+
of an empty authority component or empty host subcomponent. For many
|
2284
|
+
scheme specifications, an empty authority or host is considered an
|
2285
|
+
error; for others, it is considered equivalent to "localhost" or the
|
2286
|
+
end-user's host. When a scheme defines a default for authority and a
|
2287
|
+
URI reference to that default is desired, the reference should be
|
2288
|
+
normalized to an empty authority for the sake of uniformity, brevity,
|
2289
|
+
and internationalization. If, however, either the userinfo or port
|
2290
|
+
subcomponents are non-empty, then the host should be given explicitly
|
2291
|
+
even if it matches the default.
|
2292
|
+
|
2293
|
+
Normalization should not remove delimiters when their associated
|
2294
|
+
component is empty unless licensed to do so by the scheme
|
2295
|
+
|
2296
|
+
|
2297
|
+
|
2298
|
+
Berners-Lee, et al. Standards Track [Page 41]
|
2299
|
+
|
2300
|
+
RFC 3986 URI Generic Syntax January 2005
|
2301
|
+
|
2302
|
+
|
2303
|
+
specification. For example, the URI "http://example.com/?" cannot be
|
2304
|
+
assumed to be equivalent to any of the examples above. Likewise, the
|
2305
|
+
presence or absence of delimiters within a userinfo subcomponent is
|
2306
|
+
usually significant to its interpretation. The fragment component is
|
2307
|
+
not subject to any scheme-based normalization; thus, two URIs that
|
2308
|
+
differ only by the suffix "#" are considered different regardless of
|
2309
|
+
the scheme.
|
2310
|
+
|
2311
|
+
Some schemes define additional subcomponents that consist of case-
|
2312
|
+
insensitive data, giving an implicit license to normalizers to
|
2313
|
+
convert this data to a common case (e.g., all lowercase). For
|
2314
|
+
example, URI schemes that define a subcomponent of path to contain an
|
2315
|
+
Internet hostname, such as the "mailto" URI scheme, cause that
|
2316
|
+
subcomponent to be case-insensitive and thus subject to case
|
2317
|
+
normalization (e.g., "mailto:Joe@Example.COM" is equivalent to
|
2318
|
+
"mailto:Joe@example.com", even though the generic syntax considers
|
2319
|
+
the path component to be case-sensitive).
|
2320
|
+
|
2321
|
+
Other scheme-specific normalizations are possible.
|
2322
|
+
|
2323
|
+
6.2.4. Protocol-Based Normalization
|
2324
|
+
|
2325
|
+
Substantial effort to reduce the incidence of false negatives is
|
2326
|
+
often cost-effective for web spiders. Therefore, they implement even
|
2327
|
+
more aggressive techniques in URI comparison. For example, if they
|
2328
|
+
observe that a URI such as
|
2329
|
+
|
2330
|
+
http://example.com/data
|
2331
|
+
|
2332
|
+
redirects to a URI differing only in the trailing slash
|
2333
|
+
|
2334
|
+
http://example.com/data/
|
2335
|
+
|
2336
|
+
they will likely regard the two as equivalent in the future. This
|
2337
|
+
kind of technique is only appropriate when equivalence is clearly
|
2338
|
+
indicated by both the result of accessing the resources and the
|
2339
|
+
common conventions of their scheme's dereference algorithm (in this
|
2340
|
+
case, use of redirection by HTTP origin servers to avoid problems
|
2341
|
+
with relative references).
|
2342
|
+
|
2343
|
+
|
2344
|
+
|
2345
|
+
|
2346
|
+
|
2347
|
+
|
2348
|
+
|
2349
|
+
|
2350
|
+
|
2351
|
+
|
2352
|
+
|
2353
|
+
|
2354
|
+
Berners-Lee, et al. Standards Track [Page 42]
|
2355
|
+
|
2356
|
+
RFC 3986 URI Generic Syntax January 2005
|
2357
|
+
|
2358
|
+
|
2359
|
+
7. Security Considerations
|
2360
|
+
|
2361
|
+
A URI does not in itself pose a security threat. However, as URIs
|
2362
|
+
are often used to provide a compact set of instructions for access to
|
2363
|
+
network resources, care must be taken to properly interpret the data
|
2364
|
+
within a URI, to prevent that data from causing unintended access,
|
2365
|
+
and to avoid including data that should not be revealed in plain
|
2366
|
+
text.
|
2367
|
+
|
2368
|
+
7.1. Reliability and Consistency
|
2369
|
+
|
2370
|
+
There is no guarantee that once a URI has been used to retrieve
|
2371
|
+
information, the same information will be retrievable by that URI in
|
2372
|
+
the future. Nor is there any guarantee that the information
|
2373
|
+
retrievable via that URI in the future will be observably similar to
|
2374
|
+
that retrieved in the past. The URI syntax does not constrain how a
|
2375
|
+
given scheme or authority apportions its namespace or maintains it
|
2376
|
+
over time. Such guarantees can only be obtained from the person(s)
|
2377
|
+
controlling that namespace and the resource in question. A specific
|
2378
|
+
URI scheme may define additional semantics, such as name persistence,
|
2379
|
+
if those semantics are required of all naming authorities for that
|
2380
|
+
scheme.
|
2381
|
+
|
2382
|
+
7.2. Malicious Construction
|
2383
|
+
|
2384
|
+
It is sometimes possible to construct a URI so that an attempt to
|
2385
|
+
perform a seemingly harmless, idempotent operation, such as the
|
2386
|
+
retrieval of a representation, will in fact cause a possibly damaging
|
2387
|
+
remote operation. The unsafe URI is typically constructed by
|
2388
|
+
specifying a port number other than that reserved for the network
|
2389
|
+
protocol in question. The client unwittingly contacts a site running
|
2390
|
+
a different protocol service, and data within the URI contains
|
2391
|
+
instructions that, when interpreted according to this other protocol,
|
2392
|
+
cause an unexpected operation. A frequent example of such abuse has
|
2393
|
+
been the use of a protocol-based scheme with a port component of
|
2394
|
+
"25", thereby fooling user agent software into sending an unintended
|
2395
|
+
or impersonating message via an SMTP server.
|
2396
|
+
|
2397
|
+
Applications should prevent dereference of a URI that specifies a TCP
|
2398
|
+
port number within the "well-known port" range (0 - 1023) unless the
|
2399
|
+
protocol being used to dereference that URI is compatible with the
|
2400
|
+
protocol expected on that well-known port. Although IANA maintains a
|
2401
|
+
registry of well-known ports, applications should make such
|
2402
|
+
restrictions user-configurable to avoid preventing the deployment of
|
2403
|
+
new services.
|
2404
|
+
|
2405
|
+
|
2406
|
+
|
2407
|
+
|
2408
|
+
|
2409
|
+
|
2410
|
+
Berners-Lee, et al. Standards Track [Page 43]
|
2411
|
+
|
2412
|
+
RFC 3986 URI Generic Syntax January 2005
|
2413
|
+
|
2414
|
+
|
2415
|
+
When a URI contains percent-encoded octets that match the delimiters
|
2416
|
+
for a given resolution or dereference protocol (for example, CR and
|
2417
|
+
LF characters for the TELNET protocol), these percent-encodings must
|
2418
|
+
not be decoded before transmission across that protocol. Transfer of
|
2419
|
+
the percent-encoding, which might violate the protocol, is less
|
2420
|
+
harmful than allowing decoded octets to be interpreted as additional
|
2421
|
+
operations or parameters, perhaps triggering an unexpected and
|
2422
|
+
possibly harmful remote operation.
|
2423
|
+
|
2424
|
+
7.3. Back-End Transcoding
|
2425
|
+
|
2426
|
+
When a URI is dereferenced, the data within it is often parsed by
|
2427
|
+
both the user agent and one or more servers. In HTTP, for example, a
|
2428
|
+
typical user agent will parse a URI into its five major components,
|
2429
|
+
access the authority's server, and send it the data within the
|
2430
|
+
authority, path, and query components. A typical server will take
|
2431
|
+
that information, parse the path into segments and the query into
|
2432
|
+
key/value pairs, and then invoke implementation-specific handlers to
|
2433
|
+
respond to the request. As a result, a common security concern for
|
2434
|
+
server implementations that handle a URI, either as a whole or split
|
2435
|
+
into separate components, is proper interpretation of the octet data
|
2436
|
+
represented by the characters and percent-encodings within that URI.
|
2437
|
+
|
2438
|
+
Percent-encoded octets must be decoded at some point during the
|
2439
|
+
dereference process. Applications must split the URI into its
|
2440
|
+
components and subcomponents prior to decoding the octets, as
|
2441
|
+
otherwise the decoded octets might be mistaken for delimiters.
|
2442
|
+
Security checks of the data within a URI should be applied after
|
2443
|
+
decoding the octets. Note, however, that the "%00" percent-encoding
|
2444
|
+
(NUL) may require special handling and should be rejected if the
|
2445
|
+
application is not expecting to receive raw data within a component.
|
2446
|
+
|
2447
|
+
Special care should be taken when the URI path interpretation process
|
2448
|
+
involves the use of a back-end file system or related system
|
2449
|
+
functions. File systems typically assign an operational meaning to
|
2450
|
+
special characters, such as the "/", "\", ":", "[", and "]"
|
2451
|
+
characters, and to special device names like ".", "..", "...", "aux",
|
2452
|
+
"lpt", etc. In some cases, merely testing for the existence of such
|
2453
|
+
a name will cause the operating system to pause or invoke unrelated
|
2454
|
+
system calls, leading to significant security concerns regarding
|
2455
|
+
denial of service and unintended data transfer. It would be
|
2456
|
+
impossible for this specification to list all such significant
|
2457
|
+
characters and device names. Implementers should research the
|
2458
|
+
reserved names and characters for the types of storage device that
|
2459
|
+
may be attached to their applications and restrict the use of data
|
2460
|
+
obtained from URI components accordingly.
|
2461
|
+
|
2462
|
+
|
2463
|
+
|
2464
|
+
|
2465
|
+
|
2466
|
+
Berners-Lee, et al. Standards Track [Page 44]
|
2467
|
+
|
2468
|
+
RFC 3986 URI Generic Syntax January 2005
|
2469
|
+
|
2470
|
+
|
2471
|
+
7.4. Rare IP Address Formats
|
2472
|
+
|
2473
|
+
Although the URI syntax for IPv4address only allows the common
|
2474
|
+
dotted-decimal form of IPv4 address literal, many implementations
|
2475
|
+
that process URIs make use of platform-dependent system routines,
|
2476
|
+
such as gethostbyname() and inet_aton(), to translate the string
|
2477
|
+
literal to an actual IP address. Unfortunately, such system routines
|
2478
|
+
often allow and process a much larger set of formats than those
|
2479
|
+
described in Section 3.2.2.
|
2480
|
+
|
2481
|
+
For example, many implementations allow dotted forms of three
|
2482
|
+
numbers, wherein the last part is interpreted as a 16-bit quantity
|
2483
|
+
and placed in the right-most two bytes of the network address (e.g.,
|
2484
|
+
a Class B network). Likewise, a dotted form of two numbers means
|
2485
|
+
that the last part is interpreted as a 24-bit quantity and placed in
|
2486
|
+
the right-most three bytes of the network address (Class A), and a
|
2487
|
+
single number (without dots) is interpreted as a 32-bit quantity and
|
2488
|
+
stored directly in the network address. Adding further to the
|
2489
|
+
confusion, some implementations allow each dotted part to be
|
2490
|
+
interpreted as decimal, octal, or hexadecimal, as specified in the C
|
2491
|
+
language (i.e., a leading 0x or 0X implies hexadecimal; a leading 0
|
2492
|
+
implies octal; otherwise, the number is interpreted as decimal).
|
2493
|
+
|
2494
|
+
These additional IP address formats are not allowed in the URI syntax
|
2495
|
+
due to differences between platform implementations. However, they
|
2496
|
+
can become a security concern if an application attempts to filter
|
2497
|
+
access to resources based on the IP address in string literal format.
|
2498
|
+
If this filtering is performed, literals should be converted to
|
2499
|
+
numeric form and filtered based on the numeric value, and not on a
|
2500
|
+
prefix or suffix of the string form.
|
2501
|
+
|
2502
|
+
7.5. Sensitive Information
|
2503
|
+
|
2504
|
+
URI producers should not provide a URI that contains a username or
|
2505
|
+
password that is intended to be secret. URIs are frequently
|
2506
|
+
displayed by browsers, stored in clear text bookmarks, and logged by
|
2507
|
+
user agent history and intermediary applications (proxies). A
|
2508
|
+
password appearing within the userinfo component is deprecated and
|
2509
|
+
should be considered an error (or simply ignored) except in those
|
2510
|
+
rare cases where the 'password' parameter is intended to be public.
|
2511
|
+
|
2512
|
+
7.6. Semantic Attacks
|
2513
|
+
|
2514
|
+
Because the userinfo subcomponent is rarely used and appears before
|
2515
|
+
the host in the authority component, it can be used to construct a
|
2516
|
+
URI intended to mislead a human user by appearing to identify one
|
2517
|
+
(trusted) naming authority while actually identifying a different
|
2518
|
+
authority hidden behind the noise. For example
|
2519
|
+
|
2520
|
+
|
2521
|
+
|
2522
|
+
Berners-Lee, et al. Standards Track [Page 45]
|
2523
|
+
|
2524
|
+
RFC 3986 URI Generic Syntax January 2005
|
2525
|
+
|
2526
|
+
|
2527
|
+
ftp://cnn.example.com&story=breaking_news@10.0.0.1/top_story.htm
|
2528
|
+
|
2529
|
+
might lead a human user to assume that the host is 'cnn.example.com',
|
2530
|
+
whereas it is actually '10.0.0.1'. Note that a misleading userinfo
|
2531
|
+
subcomponent could be much longer than the example above.
|
2532
|
+
|
2533
|
+
A misleading URI, such as that above, is an attack on the user's
|
2534
|
+
preconceived notions about the meaning of a URI rather than an attack
|
2535
|
+
on the software itself. User agents may be able to reduce the impact
|
2536
|
+
of such attacks by distinguishing the various components of the URI
|
2537
|
+
when they are rendered, such as by using a different color or tone to
|
2538
|
+
render userinfo if any is present, though there is no panacea. More
|
2539
|
+
information on URI-based semantic attacks can be found in [Siedzik].
|
2540
|
+
|
2541
|
+
8. IANA Considerations
|
2542
|
+
|
2543
|
+
URI scheme names, as defined by <scheme> in Section 3.1, form a
|
2544
|
+
registered namespace that is managed by IANA according to the
|
2545
|
+
procedures defined in [BCP35]. No IANA actions are required by this
|
2546
|
+
document.
|
2547
|
+
|
2548
|
+
9. Acknowledgements
|
2549
|
+
|
2550
|
+
This specification is derived from RFC 2396 [RFC2396], RFC 1808
|
2551
|
+
[RFC1808], and RFC 1738 [RFC1738]; the acknowledgements in those
|
2552
|
+
documents still apply. It also incorporates the update (with
|
2553
|
+
corrections) for IPv6 literals in the host syntax, as defined by
|
2554
|
+
Robert M. Hinden, Brian E. Carpenter, and Larry Masinter in
|
2555
|
+
[RFC2732]. In addition, contributions by Gisle Aas, Reese Anschultz,
|
2556
|
+
Daniel Barclay, Tim Bray, Mike Brown, Rob Cameron, Jeremy Carroll,
|
2557
|
+
Dan Connolly, Adam M. Costello, John Cowan, Jason Diamond, Martin
|
2558
|
+
Duerst, Stefan Eissing, Clive D.W. Feather, Al Gilman, Tony Hammond,
|
2559
|
+
Elliotte Harold, Pat Hayes, Henry Holtzman, Ian B. Jacobs, Michael
|
2560
|
+
Kay, John C. Klensin, Graham Klyne, Dan Kohn, Bruce Lilly, Andrew
|
2561
|
+
Main, Dave McAlpin, Ira McDonald, Michael Mealling, Ray Merkert,
|
2562
|
+
Stephen Pollei, Julian Reschke, Tomas Rokicki, Miles Sabin, Kai
|
2563
|
+
Schaetzl, Mark Thomson, Ronald Tschalaer, Norm Walsh, Marc Warne,
|
2564
|
+
Stuart Williams, and Henry Zongaro are gratefully acknowledged.
|
2565
|
+
|
2566
|
+
10. References
|
2567
|
+
|
2568
|
+
10.1. Normative References
|
2569
|
+
|
2570
|
+
[ASCII] American National Standards Institute, "Coded Character
|
2571
|
+
Set -- 7-bit American Standard Code for Information
|
2572
|
+
Interchange", ANSI X3.4, 1986.
|
2573
|
+
|
2574
|
+
|
2575
|
+
|
2576
|
+
|
2577
|
+
|
2578
|
+
Berners-Lee, et al. Standards Track [Page 46]
|
2579
|
+
|
2580
|
+
RFC 3986 URI Generic Syntax January 2005
|
2581
|
+
|
2582
|
+
|
2583
|
+
[RFC2234] Crocker, D. and P. Overell, "Augmented BNF for Syntax
|
2584
|
+
Specifications: ABNF", RFC 2234, November 1997.
|
2585
|
+
|
2586
|
+
[STD63] Yergeau, F., "UTF-8, a transformation format of
|
2587
|
+
ISO 10646", STD 63, RFC 3629, November 2003.
|
2588
|
+
|
2589
|
+
[UCS] International Organization for Standardization,
|
2590
|
+
"Information Technology - Universal Multiple-Octet Coded
|
2591
|
+
Character Set (UCS)", ISO/IEC 10646:2003, December 2003.
|
2592
|
+
|
2593
|
+
10.2. Informative References
|
2594
|
+
|
2595
|
+
[BCP19] Freed, N. and J. Postel, "IANA Charset Registration
|
2596
|
+
Procedures", BCP 19, RFC 2978, October 2000.
|
2597
|
+
|
2598
|
+
[BCP35] Petke, R. and I. King, "Registration Procedures for URL
|
2599
|
+
Scheme Names", BCP 35, RFC 2717, November 1999.
|
2600
|
+
|
2601
|
+
[RFC0952] Harrenstien, K., Stahl, M., and E. Feinler, "DoD Internet
|
2602
|
+
host table specification", RFC 952, October 1985.
|
2603
|
+
|
2604
|
+
[RFC1034] Mockapetris, P., "Domain names - concepts and facilities",
|
2605
|
+
STD 13, RFC 1034, November 1987.
|
2606
|
+
|
2607
|
+
[RFC1123] Braden, R., "Requirements for Internet Hosts - Application
|
2608
|
+
and Support", STD 3, RFC 1123, October 1989.
|
2609
|
+
|
2610
|
+
[RFC1535] Gavron, E., "A Security Problem and Proposed Correction
|
2611
|
+
With Widely Deployed DNS Software", RFC 1535,
|
2612
|
+
October 1993.
|
2613
|
+
|
2614
|
+
[RFC1630] Berners-Lee, T., "Universal Resource Identifiers in WWW: A
|
2615
|
+
Unifying Syntax for the Expression of Names and Addresses
|
2616
|
+
of Objects on the Network as used in the World-Wide Web",
|
2617
|
+
RFC 1630, June 1994.
|
2618
|
+
|
2619
|
+
[RFC1736] Kunze, J., "Functional Recommendations for Internet
|
2620
|
+
Resource Locators", RFC 1736, February 1995.
|
2621
|
+
|
2622
|
+
[RFC1737] Sollins, K. and L. Masinter, "Functional Requirements for
|
2623
|
+
Uniform Resource Names", RFC 1737, December 1994.
|
2624
|
+
|
2625
|
+
[RFC1738] Berners-Lee, T., Masinter, L., and M. McCahill, "Uniform
|
2626
|
+
Resource Locators (URL)", RFC 1738, December 1994.
|
2627
|
+
|
2628
|
+
[RFC1808] Fielding, R., "Relative Uniform Resource Locators",
|
2629
|
+
RFC 1808, June 1995.
|
2630
|
+
|
2631
|
+
|
2632
|
+
|
2633
|
+
|
2634
|
+
Berners-Lee, et al. Standards Track [Page 47]
|
2635
|
+
|
2636
|
+
RFC 3986 URI Generic Syntax January 2005
|
2637
|
+
|
2638
|
+
|
2639
|
+
[RFC2046] Freed, N. and N. Borenstein, "Multipurpose Internet Mail
|
2640
|
+
Extensions (MIME) Part Two: Media Types", RFC 2046,
|
2641
|
+
November 1996.
|
2642
|
+
|
2643
|
+
[RFC2141] Moats, R., "URN Syntax", RFC 2141, May 1997.
|
2644
|
+
|
2645
|
+
[RFC2396] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
|
2646
|
+
Resource Identifiers (URI): Generic Syntax", RFC 2396,
|
2647
|
+
August 1998.
|
2648
|
+
|
2649
|
+
[RFC2518] Goland, Y., Whitehead, E., Faizi, A., Carter, S., and D.
|
2650
|
+
Jensen, "HTTP Extensions for Distributed Authoring --
|
2651
|
+
WEBDAV", RFC 2518, February 1999.
|
2652
|
+
|
2653
|
+
[RFC2557] Palme, J., Hopmann, A., and N. Shelness, "MIME
|
2654
|
+
Encapsulation of Aggregate Documents, such as HTML
|
2655
|
+
(MHTML)", RFC 2557, March 1999.
|
2656
|
+
|
2657
|
+
[RFC2718] Masinter, L., Alvestrand, H., Zigmond, D., and R. Petke,
|
2658
|
+
"Guidelines for new URL Schemes", RFC 2718, November 1999.
|
2659
|
+
|
2660
|
+
[RFC2732] Hinden, R., Carpenter, B., and L. Masinter, "Format for
|
2661
|
+
Literal IPv6 Addresses in URL's", RFC 2732, December 1999.
|
2662
|
+
|
2663
|
+
[RFC3305] Mealling, M. and R. Denenberg, "Report from the Joint
|
2664
|
+
W3C/IETF URI Planning Interest Group: Uniform Resource
|
2665
|
+
Identifiers (URIs), URLs, and Uniform Resource Names
|
2666
|
+
(URNs): Clarifications and Recommendations", RFC 3305,
|
2667
|
+
August 2002.
|
2668
|
+
|
2669
|
+
[RFC3490] Faltstrom, P., Hoffman, P., and A. Costello,
|
2670
|
+
"Internationalizing Domain Names in Applications (IDNA)",
|
2671
|
+
RFC 3490, March 2003.
|
2672
|
+
|
2673
|
+
[RFC3513] Hinden, R. and S. Deering, "Internet Protocol Version 6
|
2674
|
+
(IPv6) Addressing Architecture", RFC 3513, April 2003.
|
2675
|
+
|
2676
|
+
[Siedzik] Siedzik, R., "Semantic Attacks: What's in a URL?",
|
2677
|
+
April 2001, <http://www.giac.org/practical/gsec/
|
2678
|
+
Richard_Siedzik_GSEC.pdf>.
|
2679
|
+
|
2680
|
+
|
2681
|
+
|
2682
|
+
|
2683
|
+
|
2684
|
+
|
2685
|
+
|
2686
|
+
|
2687
|
+
|
2688
|
+
|
2689
|
+
|
2690
|
+
Berners-Lee, et al. Standards Track [Page 48]
|
2691
|
+
|
2692
|
+
RFC 3986 URI Generic Syntax January 2005
|
2693
|
+
|
2694
|
+
|
2695
|
+
Appendix A. Collected ABNF for URI
|
2696
|
+
|
2697
|
+
URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
|
2698
|
+
|
2699
|
+
hier-part = "//" authority path-abempty
|
2700
|
+
/ path-absolute
|
2701
|
+
/ path-rootless
|
2702
|
+
/ path-empty
|
2703
|
+
|
2704
|
+
URI-reference = URI / relative-ref
|
2705
|
+
|
2706
|
+
absolute-URI = scheme ":" hier-part [ "?" query ]
|
2707
|
+
|
2708
|
+
relative-ref = relative-part [ "?" query ] [ "#" fragment ]
|
2709
|
+
|
2710
|
+
relative-part = "//" authority path-abempty
|
2711
|
+
/ path-absolute
|
2712
|
+
/ path-noscheme
|
2713
|
+
/ path-empty
|
2714
|
+
|
2715
|
+
scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
|
2716
|
+
|
2717
|
+
authority = [ userinfo "@" ] host [ ":" port ]
|
2718
|
+
userinfo = *( unreserved / pct-encoded / sub-delims / ":" )
|
2719
|
+
host = IP-literal / IPv4address / reg-name
|
2720
|
+
port = *DIGIT
|
2721
|
+
|
2722
|
+
IP-literal = "[" ( IPv6address / IPvFuture ) "]"
|
2723
|
+
|
2724
|
+
IPvFuture = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )
|
2725
|
+
|
2726
|
+
IPv6address = 6( h16 ":" ) ls32
|
2727
|
+
/ "::" 5( h16 ":" ) ls32
|
2728
|
+
/ [ h16 ] "::" 4( h16 ":" ) ls32
|
2729
|
+
/ [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
|
2730
|
+
/ [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
|
2731
|
+
/ [ *3( h16 ":" ) h16 ] "::" h16 ":" ls32
|
2732
|
+
/ [ *4( h16 ":" ) h16 ] "::" ls32
|
2733
|
+
/ [ *5( h16 ":" ) h16 ] "::" h16
|
2734
|
+
/ [ *6( h16 ":" ) h16 ] "::"
|
2735
|
+
|
2736
|
+
h16 = 1*4HEXDIG
|
2737
|
+
ls32 = ( h16 ":" h16 ) / IPv4address
|
2738
|
+
IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet
|
2739
|
+
|
2740
|
+
|
2741
|
+
|
2742
|
+
|
2743
|
+
|
2744
|
+
|
2745
|
+
|
2746
|
+
Berners-Lee, et al. Standards Track [Page 49]
|
2747
|
+
|
2748
|
+
RFC 3986 URI Generic Syntax January 2005
|
2749
|
+
|
2750
|
+
|
2751
|
+
dec-octet = DIGIT ; 0-9
|
2752
|
+
/ %x31-39 DIGIT ; 10-99
|
2753
|
+
/ "1" 2DIGIT ; 100-199
|
2754
|
+
/ "2" %x30-34 DIGIT ; 200-249
|
2755
|
+
/ "25" %x30-35 ; 250-255
|
2756
|
+
|
2757
|
+
reg-name = *( unreserved / pct-encoded / sub-delims )
|
2758
|
+
|
2759
|
+
path = path-abempty ; begins with "/" or is empty
|
2760
|
+
/ path-absolute ; begins with "/" but not "//"
|
2761
|
+
/ path-noscheme ; begins with a non-colon segment
|
2762
|
+
/ path-rootless ; begins with a segment
|
2763
|
+
/ path-empty ; zero characters
|
2764
|
+
|
2765
|
+
path-abempty = *( "/" segment )
|
2766
|
+
path-absolute = "/" [ segment-nz *( "/" segment ) ]
|
2767
|
+
path-noscheme = segment-nz-nc *( "/" segment )
|
2768
|
+
path-rootless = segment-nz *( "/" segment )
|
2769
|
+
path-empty = 0<pchar>
|
2770
|
+
|
2771
|
+
segment = *pchar
|
2772
|
+
segment-nz = 1*pchar
|
2773
|
+
segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
|
2774
|
+
; non-zero-length segment without any colon ":"
|
2775
|
+
|
2776
|
+
pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
|
2777
|
+
|
2778
|
+
query = *( pchar / "/" / "?" )
|
2779
|
+
|
2780
|
+
fragment = *( pchar / "/" / "?" )
|
2781
|
+
|
2782
|
+
pct-encoded = "%" HEXDIG HEXDIG
|
2783
|
+
|
2784
|
+
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
|
2785
|
+
reserved = gen-delims / sub-delims
|
2786
|
+
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
|
2787
|
+
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
|
2788
|
+
/ "*" / "+" / "," / ";" / "="
|
2789
|
+
|
2790
|
+
Appendix B. Parsing a URI Reference with a Regular Expression
|
2791
|
+
|
2792
|
+
As the "first-match-wins" algorithm is identical to the "greedy"
|
2793
|
+
disambiguation method used by POSIX regular expressions, it is
|
2794
|
+
natural and commonplace to use a regular expression for parsing the
|
2795
|
+
potential five components of a URI reference.
|
2796
|
+
|
2797
|
+
The following line is the regular expression for breaking-down a
|
2798
|
+
well-formed URI reference into its components.
|
2799
|
+
|
2800
|
+
|
2801
|
+
|
2802
|
+
Berners-Lee, et al. Standards Track [Page 50]
|
2803
|
+
|
2804
|
+
RFC 3986 URI Generic Syntax January 2005
|
2805
|
+
|
2806
|
+
|
2807
|
+
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
|
2808
|
+
12 3 4 5 6 7 8 9
|
2809
|
+
|
2810
|
+
The numbers in the second line above are only to assist readability;
|
2811
|
+
they indicate the reference points for each subexpression (i.e., each
|
2812
|
+
paired parenthesis). We refer to the value matched for subexpression
|
2813
|
+
<n> as $<n>. For example, matching the above expression to
|
2814
|
+
|
2815
|
+
http://www.ics.uci.edu/pub/ietf/uri/#Related
|
2816
|
+
|
2817
|
+
results in the following subexpression matches:
|
2818
|
+
|
2819
|
+
$1 = http:
|
2820
|
+
$2 = http
|
2821
|
+
$3 = //www.ics.uci.edu
|
2822
|
+
$4 = www.ics.uci.edu
|
2823
|
+
$5 = /pub/ietf/uri/
|
2824
|
+
$6 = <undefined>
|
2825
|
+
$7 = <undefined>
|
2826
|
+
$8 = #Related
|
2827
|
+
$9 = Related
|
2828
|
+
|
2829
|
+
where <undefined> indicates that the component is not present, as is
|
2830
|
+
the case for the query component in the above example. Therefore, we
|
2831
|
+
can determine the value of the five components as
|
2832
|
+
|
2833
|
+
scheme = $2
|
2834
|
+
authority = $4
|
2835
|
+
path = $5
|
2836
|
+
query = $7
|
2837
|
+
fragment = $9
|
2838
|
+
|
2839
|
+
Going in the opposite direction, we can recreate a URI reference from
|
2840
|
+
its components by using the algorithm of Section 5.3.
|
2841
|
+
|
2842
|
+
Appendix C. Delimiting a URI in Context
|
2843
|
+
|
2844
|
+
URIs are often transmitted through formats that do not provide a
|
2845
|
+
clear context for their interpretation. For example, there are many
|
2846
|
+
occasions when a URI is included in plain text; examples include text
|
2847
|
+
sent in email, USENET news, and on printed paper. In such cases, it
|
2848
|
+
is important to be able to delimit the URI from the rest of the text,
|
2849
|
+
and in particular from punctuation marks that might be mistaken for
|
2850
|
+
part of the URI.
|
2851
|
+
|
2852
|
+
In practice, URIs are delimited in a variety of ways, but usually
|
2853
|
+
within double-quotes "http://example.com/", angle brackets
|
2854
|
+
<http://example.com/>, or just by using whitespace:
|
2855
|
+
|
2856
|
+
|
2857
|
+
|
2858
|
+
Berners-Lee, et al. Standards Track [Page 51]
|
2859
|
+
|
2860
|
+
RFC 3986 URI Generic Syntax January 2005
|
2861
|
+
|
2862
|
+
|
2863
|
+
http://example.com/
|
2864
|
+
|
2865
|
+
These wrappers do not form part of the URI.
|
2866
|
+
|
2867
|
+
In some cases, extra whitespace (spaces, line-breaks, tabs, etc.) may
|
2868
|
+
have to be added to break a long URI across lines. The whitespace
|
2869
|
+
should be ignored when the URI is extracted.
|
2870
|
+
|
2871
|
+
No whitespace should be introduced after a hyphen ("-") character.
|
2872
|
+
Because some typesetters and printers may (erroneously) introduce a
|
2873
|
+
hyphen at the end of line when breaking it, the interpreter of a URI
|
2874
|
+
containing a line break immediately after a hyphen should ignore all
|
2875
|
+
whitespace around the line break and should be aware that the hyphen
|
2876
|
+
may or may not actually be part of the URI.
|
2877
|
+
|
2878
|
+
Using <> angle brackets around each URI is especially recommended as
|
2879
|
+
a delimiting style for a reference that contains embedded whitespace.
|
2880
|
+
|
2881
|
+
The prefix "URL:" (with or without a trailing space) was formerly
|
2882
|
+
recommended as a way to help distinguish a URI from other bracketed
|
2883
|
+
designators, though it is not commonly used in practice and is no
|
2884
|
+
longer recommended.
|
2885
|
+
|
2886
|
+
For robustness, software that accepts user-typed URI should attempt
|
2887
|
+
to recognize and strip both delimiters and embedded whitespace.
|
2888
|
+
|
2889
|
+
For example, the text
|
2890
|
+
|
2891
|
+
Yes, Jim, I found it under "http://www.w3.org/Addressing/",
|
2892
|
+
but you can probably pick it up from <ftp://foo.example.
|
2893
|
+
com/rfc/>. Note the warning in <http://www.ics.uci.edu/pub/
|
2894
|
+
ietf/uri/historical.html#WARNING>.
|
2895
|
+
|
2896
|
+
contains the URI references
|
2897
|
+
|
2898
|
+
http://www.w3.org/Addressing/
|
2899
|
+
ftp://foo.example.com/rfc/
|
2900
|
+
http://www.ics.uci.edu/pub/ietf/uri/historical.html#WARNING
|
2901
|
+
|
2902
|
+
|
2903
|
+
|
2904
|
+
|
2905
|
+
|
2906
|
+
|
2907
|
+
|
2908
|
+
|
2909
|
+
|
2910
|
+
|
2911
|
+
|
2912
|
+
|
2913
|
+
|
2914
|
+
Berners-Lee, et al. Standards Track [Page 52]
|
2915
|
+
|
2916
|
+
RFC 3986 URI Generic Syntax January 2005
|
2917
|
+
|
2918
|
+
|
2919
|
+
Appendix D. Changes from RFC 2396
|
2920
|
+
|
2921
|
+
D.1. Additions
|
2922
|
+
|
2923
|
+
An ABNF rule for URI has been introduced to correspond to one common
|
2924
|
+
usage of the term: an absolute URI with optional fragment.
|
2925
|
+
|
2926
|
+
IPv6 (and later) literals have been added to the list of possible
|
2927
|
+
identifiers for the host portion of an authority component, as
|
2928
|
+
described by [RFC2732], with the addition of "[" and "]" to the
|
2929
|
+
reserved set and a version flag to anticipate future versions of IP
|
2930
|
+
literals. Square brackets are now specified as reserved within the
|
2931
|
+
authority component and are not allowed outside their use as
|
2932
|
+
delimiters for an IP literal within host. In order to make this
|
2933
|
+
change without changing the technical definition of the path, query,
|
2934
|
+
and fragment components, those rules were redefined to directly
|
2935
|
+
specify the characters allowed.
|
2936
|
+
|
2937
|
+
As [RFC2732] defers to [RFC3513] for definition of an IPv6 literal
|
2938
|
+
address, which, unfortunately, lacks an ABNF description of
|
2939
|
+
IPv6address, we created a new ABNF rule for IPv6address that matches
|
2940
|
+
the text representations defined by Section 2.2 of [RFC3513].
|
2941
|
+
Likewise, the definition of IPv4address has been improved in order to
|
2942
|
+
limit each decimal octet to the range 0-255.
|
2943
|
+
|
2944
|
+
Section 6, on URI normalization and comparison, has been completely
|
2945
|
+
rewritten and extended by using input from Tim Bray and discussion
|
2946
|
+
within the W3C Technical Architecture Group.
|
2947
|
+
|
2948
|
+
D.2. Modifications
|
2949
|
+
|
2950
|
+
The ad-hoc BNF syntax of RFC 2396 has been replaced with the ABNF of
|
2951
|
+
[RFC2234]. This change required all rule names that formerly
|
2952
|
+
included underscore characters to be renamed with a dash instead. In
|
2953
|
+
addition, a number of syntax rules have been eliminated or simplified
|
2954
|
+
to make the overall grammar more comprehensible. Specifications that
|
2955
|
+
refer to the obsolete grammar rules may be understood by replacing
|
2956
|
+
those rules according to the following table:
|
2957
|
+
|
2958
|
+
|
2959
|
+
|
2960
|
+
|
2961
|
+
|
2962
|
+
|
2963
|
+
|
2964
|
+
|
2965
|
+
|
2966
|
+
|
2967
|
+
|
2968
|
+
|
2969
|
+
|
2970
|
+
Berners-Lee, et al. Standards Track [Page 53]
|
2971
|
+
|
2972
|
+
RFC 3986 URI Generic Syntax January 2005
|
2973
|
+
|
2974
|
+
|
2975
|
+
+----------------+--------------------------------------------------+
|
2976
|
+
| obsolete rule | translation |
|
2977
|
+
+----------------+--------------------------------------------------+
|
2978
|
+
| absoluteURI | absolute-URI |
|
2979
|
+
| relativeURI | relative-part [ "?" query ] |
|
2980
|
+
| hier_part | ( "//" authority path-abempty / |
|
2981
|
+
| | path-absolute ) [ "?" query ] |
|
2982
|
+
| | |
|
2983
|
+
| opaque_part | path-rootless [ "?" query ] |
|
2984
|
+
| net_path | "//" authority path-abempty |
|
2985
|
+
| abs_path | path-absolute |
|
2986
|
+
| rel_path | path-rootless |
|
2987
|
+
| rel_segment | segment-nz-nc |
|
2988
|
+
| reg_name | reg-name |
|
2989
|
+
| server | authority |
|
2990
|
+
| hostport | host [ ":" port ] |
|
2991
|
+
| hostname | reg-name |
|
2992
|
+
| path_segments | path-abempty |
|
2993
|
+
| param | *<pchar excluding ";"> |
|
2994
|
+
| | |
|
2995
|
+
| uric | unreserved / pct-encoded / ";" / "?" / ":" |
|
2996
|
+
| | / "@" / "&" / "=" / "+" / "$" / "," / "/" |
|
2997
|
+
| | |
|
2998
|
+
| uric_no_slash | unreserved / pct-encoded / ";" / "?" / ":" |
|
2999
|
+
| | / "@" / "&" / "=" / "+" / "$" / "," |
|
3000
|
+
| | |
|
3001
|
+
| mark | "-" / "_" / "." / "!" / "~" / "*" / "'" |
|
3002
|
+
| | / "(" / ")" |
|
3003
|
+
| | |
|
3004
|
+
| escaped | pct-encoded |
|
3005
|
+
| hex | HEXDIG |
|
3006
|
+
| alphanum | ALPHA / DIGIT |
|
3007
|
+
+----------------+--------------------------------------------------+
|
3008
|
+
|
3009
|
+
Use of the above obsolete rules for the definition of scheme-specific
|
3010
|
+
syntax is deprecated.
|
3011
|
+
|
3012
|
+
Section 2, on characters, has been rewritten to explain what
|
3013
|
+
characters are reserved, when they are reserved, and why they are
|
3014
|
+
reserved, even when they are not used as delimiters by the generic
|
3015
|
+
syntax. The mark characters that are typically unsafe to decode,
|
3016
|
+
including the exclamation mark ("!"), asterisk ("*"), single-quote
|
3017
|
+
("'"), and open and close parentheses ("(" and ")"), have been moved
|
3018
|
+
to the reserved set in order to clarify the distinction between
|
3019
|
+
reserved and unreserved and, hopefully, to answer the most common
|
3020
|
+
question of scheme designers. Likewise, the section on
|
3021
|
+
percent-encoded characters has been rewritten, and URI normalizers
|
3022
|
+
are now given license to decode any percent-encoded octets
|
3023
|
+
|
3024
|
+
|
3025
|
+
|
3026
|
+
Berners-Lee, et al. Standards Track [Page 54]
|
3027
|
+
|
3028
|
+
RFC 3986 URI Generic Syntax January 2005
|
3029
|
+
|
3030
|
+
|
3031
|
+
corresponding to unreserved characters. In general, the terms
|
3032
|
+
"escaped" and "unescaped" have been replaced with "percent-encoded"
|
3033
|
+
and "decoded", respectively, to reduce confusion with other forms of
|
3034
|
+
escape mechanisms.
|
3035
|
+
|
3036
|
+
The ABNF for URI and URI-reference has been redesigned to make them
|
3037
|
+
more friendly to LALR parsers and to reduce complexity. As a result,
|
3038
|
+
the layout form of syntax description has been removed, along with
|
3039
|
+
the uric, uric_no_slash, opaque_part, net_path, abs_path, rel_path,
|
3040
|
+
path_segments, rel_segment, and mark rules. All references to
|
3041
|
+
"opaque" URIs have been replaced with a better description of how the
|
3042
|
+
path component may be opaque to hierarchy. The relativeURI rule has
|
3043
|
+
been replaced with relative-ref to avoid unnecessary confusion over
|
3044
|
+
whether they are a subset of URI. The ambiguity regarding the
|
3045
|
+
parsing of URI-reference as a URI or a relative-ref with a colon in
|
3046
|
+
the first segment has been eliminated through the use of five
|
3047
|
+
separate path matching rules.
|
3048
|
+
|
3049
|
+
The fragment identifier has been moved back into the section on
|
3050
|
+
generic syntax components and within the URI and relative-ref rules,
|
3051
|
+
though it remains excluded from absolute-URI. The number sign ("#")
|
3052
|
+
character has been moved back to the reserved set as a result of
|
3053
|
+
reintegrating the fragment syntax.
|
3054
|
+
|
3055
|
+
The ABNF has been corrected to allow the path component to be empty.
|
3056
|
+
This also allows an absolute-URI to consist of nothing after the
|
3057
|
+
"scheme:", as is present in practice with the "dav:" namespace
|
3058
|
+
[RFC2518] and with the "about:" scheme used internally by many WWW
|
3059
|
+
browser implementations. The ambiguity regarding the boundary
|
3060
|
+
between authority and path has been eliminated through the use of
|
3061
|
+
five separate path matching rules.
|
3062
|
+
|
3063
|
+
Registry-based naming authorities that use the generic syntax are now
|
3064
|
+
defined within the host rule. This change allows current
|
3065
|
+
implementations, where whatever name provided is simply fed to the
|
3066
|
+
local name resolution mechanism, to be consistent with the
|
3067
|
+
specification. It also removes the need to re-specify DNS name
|
3068
|
+
formats here. Furthermore, it allows the host component to contain
|
3069
|
+
percent-encoded octets, which is necessary to enable
|
3070
|
+
internationalized domain names to be provided in URIs, processed in
|
3071
|
+
their native character encodings at the application layers above URI
|
3072
|
+
processing, and passed to an IDNA library as a registered name in the
|
3073
|
+
UTF-8 character encoding. The server, hostport, hostname,
|
3074
|
+
domainlabel, toplabel, and alphanum rules have been removed.
|
3075
|
+
|
3076
|
+
The resolving relative references algorithm of [RFC2396] has been
|
3077
|
+
rewritten with pseudocode for this revision to improve clarity and
|
3078
|
+
fix the following issues:
|
3079
|
+
|
3080
|
+
|
3081
|
+
|
3082
|
+
Berners-Lee, et al. Standards Track [Page 55]
|
3083
|
+
|
3084
|
+
RFC 3986 URI Generic Syntax January 2005
|
3085
|
+
|
3086
|
+
|
3087
|
+
o [RFC2396] section 5.2, step 6a, failed to account for a base URI
|
3088
|
+
with no path.
|
3089
|
+
|
3090
|
+
o Restored the behavior of [RFC1808] where, if the reference
|
3091
|
+
contains an empty path and a defined query component, the target
|
3092
|
+
URI inherits the base URI's path component.
|
3093
|
+
|
3094
|
+
o The determination of whether a URI reference is a same-document
|
3095
|
+
reference has been decoupled from the URI parser, simplifying the
|
3096
|
+
URI processing interface within applications in a way consistent
|
3097
|
+
with the internal architecture of deployed URI processing
|
3098
|
+
implementations. The determination is now based on comparison to
|
3099
|
+
the base URI after transforming a reference to absolute form,
|
3100
|
+
rather than on the format of the reference itself. This change
|
3101
|
+
may result in more references being considered "same-document"
|
3102
|
+
under this specification than there would be under the rules given
|
3103
|
+
in RFC 2396, especially when normalization is used to reduce
|
3104
|
+
aliases. However, it does not change the status of existing
|
3105
|
+
same-document references.
|
3106
|
+
|
3107
|
+
o Separated the path merge routine into two routines: merge, for
|
3108
|
+
describing combination of the base URI path with a relative-path
|
3109
|
+
reference, and remove_dot_segments, for describing how to remove
|
3110
|
+
the special "." and ".." segments from a composed path. The
|
3111
|
+
remove_dot_segments algorithm is now applied to all URI reference
|
3112
|
+
paths in order to match common implementations and to improve the
|
3113
|
+
normalization of URIs in practice. This change only impacts the
|
3114
|
+
parsing of abnormal references and same-scheme references wherein
|
3115
|
+
the base URI has a non-hierarchical path.
|
3116
|
+
|
3117
|
+
Index
|
3118
|
+
|
3119
|
+
A
|
3120
|
+
ABNF 11
|
3121
|
+
absolute 27
|
3122
|
+
absolute-path 26
|
3123
|
+
absolute-URI 27
|
3124
|
+
access 9
|
3125
|
+
authority 17, 18
|
3126
|
+
|
3127
|
+
B
|
3128
|
+
base URI 28
|
3129
|
+
|
3130
|
+
C
|
3131
|
+
character encoding 4
|
3132
|
+
character 4
|
3133
|
+
characters 8, 11
|
3134
|
+
coded character set 4
|
3135
|
+
|
3136
|
+
|
3137
|
+
|
3138
|
+
Berners-Lee, et al. Standards Track [Page 56]
|
3139
|
+
|
3140
|
+
RFC 3986 URI Generic Syntax January 2005
|
3141
|
+
|
3142
|
+
|
3143
|
+
D
|
3144
|
+
dec-octet 20
|
3145
|
+
dereference 9
|
3146
|
+
dot-segments 23
|
3147
|
+
|
3148
|
+
F
|
3149
|
+
fragment 16, 24
|
3150
|
+
|
3151
|
+
G
|
3152
|
+
gen-delims 13
|
3153
|
+
generic syntax 6
|
3154
|
+
|
3155
|
+
H
|
3156
|
+
h16 20
|
3157
|
+
hier-part 16
|
3158
|
+
hierarchical 10
|
3159
|
+
host 18
|
3160
|
+
|
3161
|
+
I
|
3162
|
+
identifier 5
|
3163
|
+
IP-literal 19
|
3164
|
+
IPv4 20
|
3165
|
+
IPv4address 19, 20
|
3166
|
+
IPv6 19
|
3167
|
+
IPv6address 19, 20
|
3168
|
+
IPvFuture 19
|
3169
|
+
|
3170
|
+
L
|
3171
|
+
locator 7
|
3172
|
+
ls32 20
|
3173
|
+
|
3174
|
+
M
|
3175
|
+
merge 32
|
3176
|
+
|
3177
|
+
N
|
3178
|
+
name 7
|
3179
|
+
network-path 26
|
3180
|
+
|
3181
|
+
P
|
3182
|
+
path 16, 22, 26
|
3183
|
+
path-abempty 22
|
3184
|
+
path-absolute 22
|
3185
|
+
path-empty 22
|
3186
|
+
path-noscheme 22
|
3187
|
+
path-rootless 22
|
3188
|
+
path-abempty 16, 22, 26
|
3189
|
+
path-absolute 16, 22, 26
|
3190
|
+
path-empty 16, 22, 26
|
3191
|
+
|
3192
|
+
|
3193
|
+
|
3194
|
+
Berners-Lee, et al. Standards Track [Page 57]
|
3195
|
+
|
3196
|
+
RFC 3986 URI Generic Syntax January 2005
|
3197
|
+
|
3198
|
+
|
3199
|
+
path-rootless 16, 22
|
3200
|
+
pchar 23
|
3201
|
+
pct-encoded 12
|
3202
|
+
percent-encoding 12
|
3203
|
+
port 22
|
3204
|
+
|
3205
|
+
Q
|
3206
|
+
query 16, 23
|
3207
|
+
|
3208
|
+
R
|
3209
|
+
reg-name 21
|
3210
|
+
registered name 20
|
3211
|
+
relative 10, 28
|
3212
|
+
relative-path 26
|
3213
|
+
relative-ref 26
|
3214
|
+
remove_dot_segments 33
|
3215
|
+
representation 9
|
3216
|
+
reserved 12
|
3217
|
+
resolution 9, 28
|
3218
|
+
resource 5
|
3219
|
+
retrieval 9
|
3220
|
+
|
3221
|
+
S
|
3222
|
+
same-document 27
|
3223
|
+
sameness 9
|
3224
|
+
scheme 16, 17
|
3225
|
+
segment 22, 23
|
3226
|
+
segment-nz 23
|
3227
|
+
segment-nz-nc 23
|
3228
|
+
sub-delims 13
|
3229
|
+
suffix 27
|
3230
|
+
|
3231
|
+
T
|
3232
|
+
transcription 8
|
3233
|
+
|
3234
|
+
U
|
3235
|
+
uniform 4
|
3236
|
+
unreserved 13
|
3237
|
+
URI grammar
|
3238
|
+
absolute-URI 27
|
3239
|
+
ALPHA 11
|
3240
|
+
authority 18
|
3241
|
+
CR 11
|
3242
|
+
dec-octet 20
|
3243
|
+
DIGIT 11
|
3244
|
+
DQUOTE 11
|
3245
|
+
fragment 24
|
3246
|
+
gen-delims 13
|
3247
|
+
|
3248
|
+
|
3249
|
+
|
3250
|
+
Berners-Lee, et al. Standards Track [Page 58]
|
3251
|
+
|
3252
|
+
RFC 3986 URI Generic Syntax January 2005
|
3253
|
+
|
3254
|
+
|
3255
|
+
h16 20
|
3256
|
+
HEXDIG 11
|
3257
|
+
hier-part 16
|
3258
|
+
host 19
|
3259
|
+
IP-literal 19
|
3260
|
+
IPv4address 20
|
3261
|
+
IPv6address 20
|
3262
|
+
IPvFuture 19
|
3263
|
+
LF 11
|
3264
|
+
ls32 20
|
3265
|
+
OCTET 11
|
3266
|
+
path 22
|
3267
|
+
path-abempty 22
|
3268
|
+
path-absolute 22
|
3269
|
+
path-empty 22
|
3270
|
+
path-noscheme 22
|
3271
|
+
path-rootless 22
|
3272
|
+
pchar 23
|
3273
|
+
pct-encoded 12
|
3274
|
+
port 22
|
3275
|
+
query 24
|
3276
|
+
reg-name 21
|
3277
|
+
relative-ref 26
|
3278
|
+
reserved 13
|
3279
|
+
scheme 17
|
3280
|
+
segment 23
|
3281
|
+
segment-nz 23
|
3282
|
+
segment-nz-nc 23
|
3283
|
+
SP 11
|
3284
|
+
sub-delims 13
|
3285
|
+
unreserved 13
|
3286
|
+
URI 16
|
3287
|
+
URI-reference 25
|
3288
|
+
userinfo 18
|
3289
|
+
URI 16
|
3290
|
+
URI-reference 25
|
3291
|
+
URL 7
|
3292
|
+
URN 7
|
3293
|
+
userinfo 18
|
3294
|
+
|
3295
|
+
|
3296
|
+
|
3297
|
+
|
3298
|
+
|
3299
|
+
|
3300
|
+
|
3301
|
+
|
3302
|
+
|
3303
|
+
|
3304
|
+
|
3305
|
+
|
3306
|
+
Berners-Lee, et al. Standards Track [Page 59]
|
3307
|
+
|
3308
|
+
RFC 3986 URI Generic Syntax January 2005
|
3309
|
+
|
3310
|
+
|
3311
|
+
Authors' Addresses
|
3312
|
+
|
3313
|
+
Tim Berners-Lee
|
3314
|
+
World Wide Web Consortium
|
3315
|
+
Massachusetts Institute of Technology
|
3316
|
+
77 Massachusetts Avenue
|
3317
|
+
Cambridge, MA 02139
|
3318
|
+
USA
|
3319
|
+
|
3320
|
+
Phone: +1-617-253-5702
|
3321
|
+
Fax: +1-617-258-5999
|
3322
|
+
EMail: timbl@w3.org
|
3323
|
+
URI: http://www.w3.org/People/Berners-Lee/
|
3324
|
+
|
3325
|
+
|
3326
|
+
Roy T. Fielding
|
3327
|
+
Day Software
|
3328
|
+
5251 California Ave., Suite 110
|
3329
|
+
Irvine, CA 92617
|
3330
|
+
USA
|
3331
|
+
|
3332
|
+
Phone: +1-949-679-2960
|
3333
|
+
Fax: +1-949-679-2972
|
3334
|
+
EMail: fielding@gbiv.com
|
3335
|
+
URI: http://roy.gbiv.com/
|
3336
|
+
|
3337
|
+
|
3338
|
+
Larry Masinter
|
3339
|
+
Adobe Systems Incorporated
|
3340
|
+
345 Park Ave
|
3341
|
+
San Jose, CA 95110
|
3342
|
+
USA
|
3343
|
+
|
3344
|
+
Phone: +1-408-536-3024
|
3345
|
+
EMail: LMM@acm.org
|
3346
|
+
URI: http://larry.masinter.net/
|
3347
|
+
|
3348
|
+
|
3349
|
+
|
3350
|
+
|
3351
|
+
|
3352
|
+
|
3353
|
+
|
3354
|
+
|
3355
|
+
|
3356
|
+
|
3357
|
+
|
3358
|
+
|
3359
|
+
|
3360
|
+
|
3361
|
+
|
3362
|
+
Berners-Lee, et al. Standards Track [Page 60]
|
3363
|
+
|
3364
|
+
RFC 3986 URI Generic Syntax January 2005
|
3365
|
+
|
3366
|
+
|
3367
|
+
Full Copyright Statement
|
3368
|
+
|
3369
|
+
Copyright (C) The Internet Society (2005).
|
3370
|
+
|
3371
|
+
This document is subject to the rights, licenses and restrictions
|
3372
|
+
contained in BCP 78, and except as set forth therein, the authors
|
3373
|
+
retain all their rights.
|
3374
|
+
|
3375
|
+
This document and the information contained herein are provided on an
|
3376
|
+
"AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
|
3377
|
+
OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
|
3378
|
+
ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
|
3379
|
+
INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
|
3380
|
+
INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
|
3381
|
+
WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
|
3382
|
+
|
3383
|
+
Intellectual Property
|
3384
|
+
|
3385
|
+
The IETF takes no position regarding the validity or scope of any
|
3386
|
+
Intellectual Property Rights or other rights that might be claimed to
|
3387
|
+
pertain to the implementation or use of the technology described in
|
3388
|
+
this document or the extent to which any license under such rights
|
3389
|
+
might or might not be available; nor does it represent that it has
|
3390
|
+
made any independent effort to identify any such rights. Information
|
3391
|
+
on the IETF's procedures with respect to rights in IETF Documents can
|
3392
|
+
be found in BCP 78 and BCP 79.
|
3393
|
+
|
3394
|
+
Copies of IPR disclosures made to the IETF Secretariat and any
|
3395
|
+
assurances of licenses to be made available, or the result of an
|
3396
|
+
attempt made to obtain a general license or permission for the use of
|
3397
|
+
such proprietary rights by implementers or users of this
|
3398
|
+
specification can be obtained from the IETF on-line IPR repository at
|
3399
|
+
http://www.ietf.org/ipr.
|
3400
|
+
|
3401
|
+
The IETF invites any interested party to bring to its attention any
|
3402
|
+
copyrights, patents or patent applications, or other proprietary
|
3403
|
+
rights that may cover technology that may be required to implement
|
3404
|
+
this standard. Please address the information to the IETF at ietf-
|
3405
|
+
ipr@ietf.org.
|
3406
|
+
|
3407
|
+
|
3408
|
+
Acknowledgement
|
3409
|
+
|
3410
|
+
Funding for the RFC Editor function is currently provided by the
|
3411
|
+
Internet Society.
|
3412
|
+
|
3413
|
+
|
3414
|
+
|
3415
|
+
|
3416
|
+
|
3417
|
+
|
3418
|
+
Berners-Lee, et al. Standards Track [Page 61]
|
3419
|
+
|