UCS Transformation Format 16 (UTF-16)

Preface
Forward
Introduction
4 Definitions
5 General structure of the UCS
8 Special Features of the UCS
8 The Basic Multilingual Plane
9 Other Planes

9.1 Planes reserved for future standardization
9.1 Planes accesssible by UTF-16

11 Private Use groups and planes
14.1 Private Use groups and planes
Annex O

O.1 Outline of the algorithm
O.2 Notation
O.3 From UCS to UTF-16 format
O.4 From UTF-16 to UCS format
O.5 Identification of UTF-16
O.6 Incorrect Sequence of S-zone RC-elements
Advisory Note

Notes

Preface

The following text constitutes the contents of document ISO/IEC JTC1/SC2/WG2 N 1035, dated 1994-08-01, by WG2 Project Editor, Mark Davis, which specifies a normative Annex O to be added by the first proposed drafted amendment (PDAM 1) to ISO/IEC 10646-1:1993.

As of this date (1994-12-01), the PDAM 1 is being balloted by ISO membership (in an extended balloting period). The PDAM 1 may or may not be approved and/or modified prior to becoming part of ISO/IEC 10646 (assuming it is approved).

The text below has been slightly edited to accomodate HTML format and limitations (e.g., footnotes have been placed at the end of the text).

Forward

ISO (the International Organization for Standardization) and IEC (the International Electrotechnical Commission) form the specialised system for worldwide standardisation. National bodies that are members of ISO or IEC participate in the development of International Standards through technical committees established by the respective organisation to deal with particular fields of technical activity. ISO and IEC technical committees collaborate in fields of mutual interest. Other international organisations, governmental and non-governmental, in liaison with ISO and IEC, also take part in the work.

In the field of information technology, ISO and IEC have established a joint technical committee, ISO/IEC JTC 1. Draft International Standards adopted by the joint technical committee are circulated to national bodies for voting. Publication as an International Standard requires approval by at least 75% of the national bodies casting a vote.

Amendment 2 to International Standard ISO/IEC 10646-1 was prepared by Joint Technical Committee ISO/IEC JTC1, Information technology.

Annex O forms an integral part of this amendment.

Introduction

ISO/IEC 10646 specifies the Universal Multiple-Octet Coded Character Set (UCS). It is applicable to the representation, transmission, interchange, processing, storage, input and presentation of the written form of the languages (scripts) of the world as well as additional symbols.

This amendment to ISO/IEC 10646 specifies an additional transformation format, UTF-16. This format transforms the coded representation of graphic characters in this coded character set into a 2-octet form that permits the representation of over a million graphic characters of UCS-4.

4 Definitions

ISO/IEC 10646-1, section 4 applies with the following additions:

4.34 high-half zone: a set of cells reserved for use in UTF-16 (see Annex O); an RC-element corresponding to any of these cells may be used as the first of a pair of RC-elements which represents a character from a plane other than the BMP.
4.35 low-half zone: a set of cells reserved for use in UTF-16 (see Annex O); an RC-element corresponding to any of these cells may be used as the second of a pair of RC-elements which represents a character from a plane other than the BMP.
4.36 RC-element: a two-octet sequence comprising the R-octet and the C-octet (see 6.2) from the four octet sequence that corresponds to a cell in the coding space of this coded character set.
4.37 high-half RC-element: an RC-element from the high-half zone.
4.37 low-half RC-element: an RC-element from the low-half zone.
4.37 unpaired RC-element: An RC-element in a CC-data element that is either:

5 General structure of the UCS

ISO/IEC 10646-1, section 5 applies with the text amended to read

Strike out the following text:

The 32 planes with Plane-octet values E0 to FF of Group 00 are for Private Use. The 32 groups with Group-octet values 60 to 7F of this coded character set are also for Private Use.

Add the following text:

The planes that are reserved for Private Use are specified in Section 11.
A UCS Transformation Format (UTF-16) is specified in Annex O which can be used to represent characters from 16 planes, additional to the BMP, in a form that is compatible with the two-octet BMP form.

7 Special features of the UCS

ISO/IEC 10646-1, section 7 applies with the text in paragraph 2 amended to read:

2. Code positions to which a character is not allocated, except for the positions reserved for Private Use characters or for transformation formats, are reserved for future standardisation and shall not be used for any other purpose. Future editions of ISO/IEC 10646 will not allocate any characters to code positions reserved for Private Use characters or for transformation formats.

8 The Basic Multilingual Plane

ISO/IEC 10646-1, section 8 applies with the text amended to read:

Strike out the following text:

O-zone: code positions A000 to DFFF
O-zone: code positions A000 to D7FF
S-zone: code positions D800 to DFFF

The O-zone is reserved for future standardisation.

Add the following text:

The O-zone is reserved for future standardisation. The S-zone is reserved for the use of UTF-16 (see Annex O).

9 Other planes

ISO/IEC 10646-1, section 9 is amended to read:

9.1 Planes reserved for future standardization

Planes 12 to FF in Group 00 and planes 00 to FF in Groups 01 to 7F are reserved for future standardisation, and thus those code positions shall not be used for any other purpose.

9.1 Planes accessible by UTF-16

Code positions in planes 01 to 10 in group 00 may be transformed to the UTF-16 representation (see Annex O). This representation is compatible with the two-octet BMP form of UCS-2 (see 14.1).

Code positions in planes 11 to FF in group 00 or in other groups cannot be transformed to the UTF-16 representation.

11 Private Use groups and planes

ISO/IEC 10646-1, section 11 applies with the text amended to read:

Strike out the following text:

The code positions of 32 planes from Plane E0 to Plane FF of Group 00 shall be for Private Use. The code positions of the 32 groups from Group 60 to Group 7F shall be for Private Use.

Add the following text:

The code positions of 2 planes, Plane 0F and Plane 10 of Group 00, shall be for Private Use.

14.1 Private Use groups and planes

ISO/IEC 10646-1, section 14.1 applies with the text amended to add the following at the end of paragraph 2:

(i.e. its RC-element).

Annex O (Normative)

UCS transformation format 16 (UTF-16)

The following method transforms the coded representation of over a million graphic characters of UCS-4 into a form that is compatible with the two-octet BMP form of UCS-2 (section 14.1). This permits the coexistence of those UCS-4 characters within coded character data that is in accordance with UCS-2.

In UTF-16 each graphic character from the UCS-2 repertoire retains its UCS-2 coded representation. In addition, the coded representation of any character from a single contiguous block of 16 Planes in Group 00 (1,048,576 code positions) is transformed to pairs of two-octet sequences, where each sequence corresponds to a cell in a single contiguous block of 8 Rows in the BMP (2,048 code positions). These codes are reserved for the use of this transformation method, and shall not be allocated for any other purpose.

O.1 Outline of the algorithm

The algorithm can be summarized as follows:

The high-half zone shall be the 4 rows D8 to DB of the BMP, i.e., the 1,024 cells whose code positions are from D800 through DBFF.
The low-half zone shall be the 4 rows DC to DF of the BMP, i.e., the 1,024 cells whose code positions are from DC00 through DFFF.
All cells in the high-half zone and the low-half zone shall be permanently reserved for the use of this transformation method.
The two-octet sequence comprising the R-octet and the C-octet for any cell in the high-half zone or the low-half zone shall be known as an S-zone RC-element.
In UTF-16, any UCS character from the BMP shall be represented by its UCS-2 coded representation as specified by the body of this international standard.
In UTF-16, any UCS character whose UCS-4 coded representation is in the range 0001 0000 to 00E1 FFFF shall be represented by a sequence of two S-zone RC-elements, of which the first is a high-half RC-element, and the second is a low-half RC-element.

The algorithm for mapping from UCS-4 to UTF-16 for these characters is given in O.3, and the algorithm for the reverse mapping is shown in O.4.

O.2 Notation

The notation is similar to the notation of Annex G.2

All numbers are in hexadecimal notation.
Double-octet boundaries in the transformed text are indicated with semicolons.
The symbol "%" indicates the modulo operation, e.g.: x % y = x modulo y.
The symbol "/" indicates the integer division operation, e.g.: 7 / 3 = 2.
Superscripting indicates the power-of operation, e.g.: 2^3 = 8.
Precedence is "3" > "/" > "%", e.g.: x / yz % w = ((x / (yz)) % w).

O.3 From UCS to UTF-16 format

UCS                 UTF-16
x =  0000 0000..    x;
     0000 FFFD1

x =  0001 0000..    y; z;
     0010 FFFF
                    where
                    y = ((x - 0001 0000) / 400) + D800
                    z = ((x - 0001 0000) % 400) + DC00

x >= 0011 0000   unmapped

Example:

The UCS-4 sequence

    [0000 0048] [0000 0069] [0001 0000] 
          [0000 0021] [0000 0021]

represents "Hi<0001 0000>!!". It is mapped to UTF-16 as:

    [0048] [0069] [D800] [DC00] [0021] [0021]

Notice that if interpreted as UCS-2, this sequence consists of

    "Hi<high zone element><low zone element>!!"

O.4 From UTF-16 to UCS format

UTF-16              UCS
x = 0000..D7FF;     x

x = D800..DBFF;     ((x - D800) * 400 + (y - DC00)) + 0001 0000
y = DC00..DFFF;

x = E000..FFFD;     x

Example:

The UTF-16 sequence

    [0048] [0069] [D800] [DC00] [0021] [0021]

is mapped back to UCS as the coded representation of "Hi<0001 0000>!!", i.e.:

    [0000 0048] [0000 0069] [0001 0000] 
          [0000 0021] [0000 0021]

O.5 Identification of UTF-16

When the escape sequences from ISO 2022 are used, the identification of the UTF-16 can be given by a designation sequence: ESC 02/05 04/03 [Ed. Note: identify correct sequence].

If such an escape sequence appears within a CC-data-element conforming to ISO/IEC 10646, it shall be padded in accordance with clause 16. When the escape sequences from ISO 2022 are used, the identification of the return from UTF-16 to the coding system of ISO 2022 shall be by the escape sequence ESC 02/05 04/00.

O.6 Unpaired RC-elements: Interpretation by receiving devices

According to O.1 an unpaired RC-element is not in conformance with the requirements of UTF-16. It cannot be transmitted by an originating device that is in conformance with the requirements of UTF-16, unless that originating device is retransmitting an unpaired RC-element that it previously received.

If a receiving device receives an unpaired RC-element because of error conditions either:

in the originating device, or
in the interchange between the originating and the receiving device, or
in the receiving device itself,

then it shall interpret that unpaired RC-element in the same way that it interprets a character that is outside the adopted subset that has been identified for the device (see 2.3c).

Advisory Notes

The high-half zone and low-half zone are assigned to separate ranges so that the function of every two-octet unit is always immediately identifiable from its value, without regard to context. Since a high-half RC-element followed by a low-half RC-element constitutes an unambiguous pair, the only possible type of syntactically malformed sequence is an unpaired RC-element.

Example:

A receiving/originating device which only handles the Latin-1 repertoire and uses boxes to display missing glyphs would display:

    "The Greek letter <alpha> corresponds to <hieroglyphicHigh>."

as:

    "The Greek letter <box> corresponds to <box>."

UTF-16 is designed to be compatible with the UCS-2 two-octet BMP Form (section 14.1). For example, the same sequence [0048] [0069] [D800] [DC00] [0021] [0021] may still be interpreted in UCS-2 as the coded representation of "Hi<unrecognized><unrecognized>!!"

That is, interpreted in UCS-4 as:

    [0000 0048] [0000 0069] [0000 D800] 
    [0000 DC00] [0000 0021] [0000 0021]

This form of compatibility is possible because the S-zone RC-elements are either interpreted according to UTF-16 or as unrecognized characters. This allows originating devices to transmit UTF-16 data even if the receiver interprets that text as UCS-2 text.

Implementations may choose to use UTF-16 as an internal text format. There are two primary issues for such devices:

Does the implementation interpret (i.e., process according to the assigned semantics) some subset of the pairs of High and Low-half RC-element codes, e.g., render the pair as the intended single character?
Does the implementation guarantee the integrity of every pair of successive high and low-half RC-elements, e.g., never separate such pairs in operations such as string truncation, insertion, or other modifications of the coded character sequence?

These issues give rise to four degrees of implementation:

(U) UCS-2 implementations:: Interpret no pairs.
Do not guarantee integrity of pairs.
(W) Weak UTF-16 implementations:: Interpret a non-null subset of pairs.
Do not guarantee integrity of pairs.
(A) Aware UTF-16 implementations:: Interpret no pairs.
Guarantee integrity of pairs.
(S) Strong UTF-16 implementations:: Interpret a non-null subset of pairs.
Guarantee integrity of pairs.

Example:

The following sentence could be displayed in three different ways, assuming that both the weak and strong implementations have Phoenician fonts but no hieroglyphics:

    "The Greek letter <alpha> corresponds to <hieroglyphicHigh><hieroglyphicLow>
    and to <phoenicianHigh><phoenicianLow>."

U:: "The Greek letter <alpha> corresponds to <box><box> and to <box><box>."
W:: "The Greek letter <alpha> corresponds to <box><box> and to <Phoenician>."
A:: "The Greek letter <alpha> corresponds to <box> and to <box>."
S:: "The Greek letter <alpha> corresponds to <box> and to <Phoenician>."

Notes

Note 1

Notice that values from 0000 D800 to 0000 DFFF are reserved, and would never be mapped.

Original document is located at http://www.stonehand.com/unicode/standard/utf16.html