7.1.1. The charset parameter
Connected: An Internet Encyclopedia
7.1.1. The charset parameter
Up:
Connected: An Internet Encyclopedia
Up:
Requests For Comments
Up:
RFC 1521
Up:
7. The Predefined Content-Type Values
Up:
7.1 The Text Content-Type
Prev: 7.1 The Text Content-Type
Next: 7.1.2. The Text/plain subtype
7.1.1. The charset parameter
7.1.1. The charset parameter
A critical parameter that may be specified in the Content-Type field
for text/plain data is the character set. This is specified with a
"charset" parameter, as in:
Content-type: text/plain; charset=us-ascii
Unlike some other parameter values, the values of the charset
parameter are NOT case sensitive. The default character set, which
must be assumed in the absence of a charset parameter, is US-ASCII.
The specification for any future subtypes of "text" must specify
whether or not they will also utilize a "charset" parameter, and may
possibly restrict its values as well. When used with a particular
body, the semantics of the "charset" parameter should be identical to
those specified here for "text/plain", i.e., the body consists
entirely of characters in the given charset. In particular, definers
of future text subtypes should pay close attention the the
implications of multibyte character sets for their subtype
definitions.
This RFC specifies the definition of the charset parameter for the
purposes of MIME to be a unique mapping of a byte stream to glyphs, a
mapping which does not require external profiling information.
An initial list of predefined character set names can be found at the
end of this section. Additional character sets may be registered
with IANA, although the standardization of their use requires the
usual IESG [RFC-1340] review and approval. Note that if the
specified character set includes 8-bit data, a Content-Transfer-
Encoding header field and a corresponding encoding on the data are
required in order to transmit the body via some mail transfer
protocols, such as SMTP.
The default character set, US-ASCII, has been the subject of some
confusion and ambiguity in the past. Not only were there some
ambiguities in the definition, there have been wide variations in
practice. In order to eliminate such ambiguity and variations in the
future, it is strongly recommended that new user agents explicitly
specify a character set via the Content-Type header field. "US-
ASCII" does not indicate an arbitrary seven-bit character code, but
specifies that the body uses character coding that uses the exact
correspondence of codes to characters specified in ASCII. National
use variations of ISO 646 [ISO-646] are NOT ASCII and their use in
Internet mail is explicitly discouraged. The omission of the ISO 646
character set is deliberate in this regard. The character set name
of "US-ASCII" explicitly refers to ANSI X3.4-1986 [US-ASCII] only.
The character set name "ASCII" is reserved and must not be used for
any purpose.
NOTE: RFC 821 explicitly specifies "ASCII", and references an
earlier version of the American Standard. Insofar as one of the
purposes of specifying a Content-Type and character set is to
permit the receiver to unambiguously determine how the sender
intended the coded message to be interpreted, assuming anything
other than "strict ASCII" as the default would risk unintentional
and incompatible changes to the semantics of messages now being
transmitted. This also implies that messages containing
characters coded according to national variations on ISO 646, or
using code-switching procedures (e.g., those of ISO 2022), as well
as 8-bit or multiple octet character encodings MUST use an
appropriate character set specification to be consistent with this
specification.
The complete US-ASCII character set is listed in [US-ASCII]. Note
that the control characters including DEL (0-31, 127) have no defined
meaning apart from the combination CRLF (ASCII values 13 and 10)
indicating a new line. Two of the characters have de facto meanings
in wide use: FF (12) often means "start subsequent text on the
beginning of a new page"; and TAB or HT (9) often (though not always)
means "move the cursor to the next available column after the current
position where the column number is a multiple of 8 (counting the
first column as column 0)." Apart from this, any use of the control
characters or DEL in a body must be part of a private agreement
between the sender and recipient. Such private agreements are
discouraged and should be replaced by the other capabilities of this
document.
NOTE: Beyond US-ASCII, an enormous proliferation of character sets
is possible. It is the opinion of the IETF working group that a
large number of character sets is NOT a good thing. We would
prefer to specify a single character set that can be used
universally for representing all of the world's languages in
electronic mail. Unfortunately, existing practice in several
communities seems to point to the continued use of multiple
character sets in the near future. For this reason, we define
names for a small number of character sets for which a strong
constituent base exists.
The defined charset values are:
US-ASCII -- as defined in [US-ASCII].
ISO-8859-X -- where "X" is to be replaced, as necessary, for the
parts of ISO-8859 [ISO-8859]. Note that the ISO 646
character sets have deliberately been omitted in favor of
their 8859 replacements, which are the designated character
sets for Internet mail. As of the publication of this
document, the legitimate values for "X" are the digits 1
through 9.
The character sets specified above are the ones that were relatively
uncontroversial during the drafting of MIME. This document does not
endorse the use of any particular character set other than US-ASCII,
and recognizes that the future evolution of world character sets
remains unclear. It is expected that in the future, additional
character sets will be registered for use in MIME.
Note that the character set used, if anything other than US-ASCII,
must always be explicitly specified in the Content-Type field.
No other character set name may be used in Internet mail without the
publication of a formal specification and its registration with IANA,
or by private agreement, in which case the character set name must
begin with "X-".
Implementors are discouraged from defining new character sets for
mail use unless absolutely necessary.
The "charset" parameter has been defined primarily for the purpose of
textual data, and is described in this section for that reason.
However, it is conceivable that non-textual data might also wish to
specify a charset value for some purpose, in which case the same
syntax and values should be used.
In general, mail-sending software must always use the "lowest common
denominator" character set possible. For example, if a body contains
only US-ASCII characters, it must be marked as being in the US-ASCII
character set, not ISO-8859-1, which, like all the ISO-8859 family of
character sets, is a superset of US-ASCII. More generally, if a
widely-used character set is a subset of another character set, and a
body contains only characters in the widely-used subset, it must be
labeled as being in that subset. This will increase the chances that
the recipient will be able to view the mail correctly.
Next: 7.1.2. The Text/plain subtype
Connected: An Internet Encyclopedia
7.1.1. The charset parameter
|