Extended UCS-2 Encoding Form (UTF-16)


The basic Unicode character repertoire and UCS-2 encoding form is based on the Basic Multilingual Plane (BMP) of ISO/IEC 10646. This plane comprises the first 65,536 code positions of ISO/IEC 10646's canonical code space (UCS-4, a 32-bit code space). Because of a decision by the Unicode Consortium to maintain synchronization between Unicode and ISO/IEC 10646, the Unicode Character Set may some day require access to other planes of 10646 outside the BMP. In order to accommodate this eventuality, the Unicode Consortium proposed an extension technique for encoding non-BMP characters in a UCS-2 Unicode string. This proposal was entitled UCS-2E, for extended UCS-2. This technique is now referred to as UTF-16 (for UCS Transformation Format 16 Bit Form).

The current definition of UTF-16 is found in document ISO/IEC JTC1/SC2/WG2 N 1035 (1994-08-01), entitled "ISO/IEC 10646-1 Proposed Draft Amendment 1". This document contains the draft for "Amendment 1: UCS Transformation Format 16 (UTF-16)" an amendment proposed for ISO/IEC 10646-1:1993.

Basically, UTF-16 allows the inclusion of certain UCS-4 codes in a UCS-2 encoded string. It does this by reserving 1024 high-half zone codes from the BMP (D800 - DBFF) and 1024 low-half zone codes from the BMP (DC00 - DFFF). Using a contiguous pair of codes as:

<high-half zone code> <low-half zone code>

gives one the ability to encode 1024 * 1024 additional code points from UCS-4 in a manner compatible with UCS-2 encoded strings. These codes are then mapped from/onto 16 planes (1-10) of group 0. Planes 1-E will be available for standard encodings while planes F and 10 will be for private use. The remaining planes of group 0 and all planes of groups 1-7F will be reserved.

This means that there will be 14*65536 = 917504 code positions for new standardized characters and 131072 additional private use code positions. Given these numbers, it is rather unlikely than any other portion of the UCS-4 encoding space will be employed.

Any Unicode system can be qualified as to its support for UTF-16 as follows:

UCS-2 only
interpret no pairs, no pair integrity
weak UTF-16
interpret pairs, no pair integrity
aware UTF-16
interpret no pairs, pair integrity
strong UTF-16
interpret pairs, pair integrity

The four degrees of support specified above are based on how UCS-2 systems/apps treat UTF-16 data, according to whether they preserve pair integrity (e.g., don't delete one element of a pair or insert something between pairs) and whether they interpret such pairs as UCS-4 elements.

Examples

The following example shows a sequence of Unicode code elements in their standard UCS-2 form which employs the UTF-16 extension technique.

UCS2    s1[] =
{
  0x0041,       // 'a'
  0x0020,       // ' '
  0xD800,       // high-half zone part
  0xDC00,       // low-half zone part
  0xD800,       // etc.
  0xDC01,
  ...
};

This example text contains 4 coded characters. The first two are BMP characters coded with a single UCS-2 code value; the last two are non-BMP characters coded with two UCS-2 code values each, a high-half code and a low-half code. Translating this to UCS-4 code values would produce the following:

UCS4    s2[] =
{
  0x00000041,   // 'a'
  0x00000020,   // ' '
  0x00010000,   // hieroglyph foo
  0x00010001,   // hieroglyph bar
  ...
};

Assuming for a moment that plane 1 began with a Egyptian Hieroglyph code block, the above UTF-16 string contains a transformed encoding of the first two characters in plane 1.

Implementation Issues (Informational)

As is clear by the above example, UTF-16 is essentially a variable length encoding technique that supports the characters in the BMP and the 16 planes immediately following the BMP. The introduction of a variable length encoding brings up a number of important issues for implementations which are similar to the issues encountered in common double byte character systems, in which single byte and double byte characters are mixed together.

The primary issue regarding the processing of such an encoding form is whether or not to use the variable length form as an internal processing form or not. If it is used for internal processing, then care must be taken to insure the integrity of variable length sequences. For example, an application should not delete either element of a sequence nor should it allow the insertion of code elements into the middle of a sequence.

In addition to requiring maintenance of variable length sequence integrity, the variable length code generally requires more complext string processing functions to be used internally. For example, in order to count the number of characters encoded in a UTF-16 string, it is necessary to scan the entire string. This is in contrast to the pure UCS-2 approach where the number of characters is equal to the strings byte length divided by two.

Some of the options for a processing code for implementations which wish to support strong UTF-16 capabilities are:

The first choice requires using an inherently variable length encoding internally. This complicates software, reduces performance, and is subject to possible weakened data integrity. The UCS-4 option imposes a significant memory burden for the majority of text content, given that few non-BMP characters will generally be used in a given text (if used at all).

The third option shown above, UCS-2 with dynamic private character assignment, requires some additional investment in software to handle remapping at import and export time; however it has the advantage of retaining a UCS-2 based internal processing code. This approach is limited by the size of the private use area (6,400 code positions). Depending on how this support is implemented, it is possible that more than 6,400 non-BMP characters would require remapping. In this case, it would be impossible to use the UCS-2 form if such a limit were exceeded. However, the likelihood that such a limit would be reached is, in general, quite small.


Copyright © 1994 Unicode, Inc.