13. MULTIPLE-OCTET CODE, ISO/IEC 10646

13. MULTIPLE-OCTET CODE, ISO/IEC 10646

After all the attempts to force too many characters into too few bits, (8-bit uniform, shifts, mixed single/double byte), the inescapable next step was to code a character with two octets uniformly. With the then available maximum of 65536 positions everything should be possible. A little bit of counting, however, produced the result that the total number of Chinese and Japanese characters still exceeded this amount. To cope with the problem, the new ISO/IEC 10646, Universal Multiple-Octet Coded Character Set (UCS) is based on 4 octets per character. Because a large part of the world has no need for this generality, a Basic Multilingual Plane (BMP) has been defined, based on 2 octets, by which limited Chinese, no (separate) Japanese, and all other scripts can be coded. This BMP has been approved recently and published as Part 1 (ISO/IEC 10646-1:1993). (Japanese will be served by a Part 2.)

More in detail, the structure of the coding in ISO/IEC 10646 is based on 4 octets, (this is called the "canonical form", UCS-4), but a 2-octet "form" (UCS-2) is provided for. The first two octets indicate "group" and "plane" (of which 256*256=65536 are possible). These may be omitted in cases where it is obvious that the 2-octet form is being used. Part 1 of ISO/IEC 10646 specifies only the Basic Multilingual Plane (BMP), where octets 1 and 2 are assumed to have 00 00 for values. The third octet (or the first of two) is called "row" (256 in total). The fourth is called "cell" (per row 256). For all cells of BMP, ISO/IEC 10646-1 specifies codetables and a list of the names of the characters, occupying a couple of pages for the first 128 cells of a row, and another couple for the second 128.

ISO/IEC 10646-1 consists of 754 pages, and is unmanageable in this form. Two thirds are occupied by Chinese characters. For Europe a minimal subset would be desirable. The definition of such a thing is under way now, but a draft available in July 1993 could not get consensus on the details, and it is to be feared that, should an agreement be reached at last, the resulting document would be very much like a typical committee product.

Anyhow, it is clear that this subset will include the repertoire for Latin letters from ISO 6937 (and little more), plus Greek and European Cyrillic. With this a uniform 2-octet coding for the whole of Europa becomes available, without the disadvantages of mixed coding. A model of a practicable subset for both sides of the Ocean is given in the Annexes, including a complete list with the names and the coding of the characters such as are contained in the repertoire of ISO 6937.

The publications of the Unicode-consortium are outdated, after the approval of ISO/IEC 10646-1. New versions may appear from this industrial group, but these will be conforming to the coding specified in ISO/IEC 10646-1. We have now to await the first implementations.