For the syntax itself there is little choice except for the order
and punctuation of the elements, and the acceptable characters and
The extensibility requirement is met by allowing an arbitrary (but
registered) string to be used as a prefix. A prefix is chosen as
left to right parsing is more common than right to left. The
choice of a colon as separator of the prefix from the rest of the
URI was arbitrary.
The decoding of the rest of the string is defined as a function of
the prefix. New prefixed are introduced for new schemes as
necessary, in agreement with the registration authority. The
registration of a new scheme clearly requires the definition of
the decoding of the URI into a given name space, and a definition
of the properties and, where applicable, resolution protocols, for
the name space.
The completeness requirement is easily met by allowing
particularly strange or plain binary names to be encoded in base
16 or 64 using the acceptable characters.
The printability requirement could have been met by requiring all
schemes to encode characters not part of a basic set. This led to
many discussions of what the basic set should be. A difficult
case, for example, is when an ISO latin 1 string appears in a URL,
and within an application with ISO Latin-1 capability, it can be
handled intact. However, for transport in general, the non-ASCII
characters need to be escaped.
The solution to this was to specify a safe set of characters, and
a general escaping scheme which may be used for encoding "unsafe"
characters. This "safe" set is suitable, for example, for use in
electronic mail. This is the canonical form of a URI.
The choice of escape character for introducing representations of
non-allowed characters also tends to be a matter of taste. An
ANSI standard exists in the C language, using the back-slash
character "\". The use of this character on unix command lines,
however, can be a problem as it is interpreted by many shell
programs, and would have itself to be escaped. It is also a
character which is not available on certain keyboards. The equals
sign is commonly used in the encoding of names having
attribute=value pairs. The percent sign was eventually chosen as
a suitable escape character.
There is a conflict between the need to be able to represent many
characters including spaces within a URI directly, and the need to
be able to use a URI in environments which have limited character
sets or in which certain characters are prone to corruption. This
conflict has been resolved by use of an hexadecimal escaping
method which may be applied to any characters forbidden in a given
context. When URLs are moved between contexts, the set of
characters escaped may be enlarged or reduced unambiguously.
The use of white space characters is risky in URIs to be printed
or sent by electronic mail, and the use of multiple white space
characters is very risky. This is because of the frequent
introduction of extraneous white space when lines are wrapped by
systems such as mail, or sheer necessity of narrow column width,
and because of the inter-conversion of various forms of white
space which occurs during character code conversion and the
transfer of text between applications. This is why the canonical
form for URIs has all white spaces encoded.