15. CHARACTER AS OBJECT, IDENTIFICATION METHODS, ISO 7350

Since the times of Gutenberg and Laurens Janszoon Coster the idea became accepted that a letter, just like a piece of wood or lead, could be manipulated in isolation. The old monks liked to connect letters to each other in their manuscripts, to save space and to facilitate writing (the so-called ligatures) or put letters above or below others and applied abbreviations. Now that a letter box was ready at hand, the need vanished for these tricks. Also the endless variation of shapes, like those for initial capitals, got rather reduced. The box of the master printer just contained slots enough for a limited number of sizes. Material was ordered from a specialist, in some style or other, of which one selected only a few.

When the first mechanical administrations were set up, the variety had still more decreased, a single "font" and capitals only, that was all. This was regretted, but people were inclined to accept the limitations, justified as a sacrifice to progress. As soon as the technical development allowed for more, any character addition was applauded, but a Pandora's box got slowly opened. Problems came out of it. An A had always been a Latin capital letter, but that from Greek looks quite the same, and equally so that from the Cyrillic script. What is now what?

At computer processing, where everything has to be unambiguous and precise, an identification problem arose. A solution was found by identifying a character through its NAME, which is specified fully in the standard, and is written with capital letters according to fixed rules. Unfortunately, the result is rather longwinded, like CYRILLIC CAPITAL LETTER BYELORUSSIAN-UKRAINIAN I. We are accustomed to look, not to a character, which is an abstraction, but to its visual appearance, the "graphic symbol" (see chapter 3). We look for an A, not for the LATIN CAPITAL LETTER A. This situation is not uncommon. We identify our fellow citizens to their face, not to their name, except when we are in doubt. Also we assign to a person more than one name, a short one for at home, and a long one for the public registers, with many variants in between. Thus there was introduced for Latin characters an official long name, and an informal short one, the SHORT IDENTIFIER (SID), such as is specified in an Annex of ISO 6937, and also used internally by IBM. The SID consists of two letters and two digits. IBM is adding another four digits to it.

The first letter indicates a classification, L for Latin, N for digits, S for specials, (with IBM also G for Greek, A for Arabic, H for Hebrew, K for Cyrillic, J for Japanese katakana, O for Korean, B for Thai). The second letter is that we know, the digits serve for marking the presence of an accent or a special form. Odd numbers are for small letters, for the capitals 1 is added. (IBM uses the 4 other digits to distinguish variants of shape or position on the line.)

SMALL LETTERS
 
11 ACUTE 21 CARON 31 MACRON 41 CEDILLA 51 LIGATURE
13 GRAVE 23 BREVE 33   43 OGONEK    
15 CIRCUMFLEX 25 DOUBLEACUTE 35   45 DESCENDER    
17 DIAERESIS 27 RING ABOVE 37   47 CARON & DESCENDER    
19 TILDE 29 DOT ABOVE 39   49 HOOK    
61 STROKE
63 SPECIAL
65 SPECIAL
67 SPECIAL
69 SPECIAL

Examples:
 
LA01 LATIN SMALL LETTER A
LA02 LATIN CAPITAL LETTER A
GA02 GREEK CAPITAL LETTER A
KA02 CYRILLIC CAPITAL LETTER A
LA11 LATIN SMALL LETTER A WITH ACUTE
LA12 LATIN CAPITAL LETTER A WITH ACUTE
LA13 LATIN SMALL LETTER A WITH GRAVE
LA14 LATIN CAPITAL LETTER A WITH GRAVE

Next to that for characters an identification rule is useful for repertoires and codetables. Unfortunately, no convention has been invented as yet that is satifactory in all situations. ISO 2375 identifies coded character sets with a final byte and a number. But for that method the codetable has to be based on a C0 G0 C1 G1 structure, and PC-codes and EBCDIC are not of that kind.

ISO 7350 identifies repertoires, but only those derived from ISO 10367, not yet for those from ISO 10646. The choice is not free. A repertoire for registration has to consist of one or more complete tables from ISO 10367, or of a list of separate characters taken from ISO 6937, either or not combined with one or more codetables from ISO 10367 for non-Latin script.

Various other ISO-standards define an identification for a coded character set or a repertoire, like those for networks (OSI) or for electronic data interchange (EDI). In Handbook Part 1, 4.3.2 these have been discussed already.

The standards for networks are referring to the notation system ASN.1 (Abstract Syntax Notation One), that is specified in ISO 8824. Because the "character string types" such as defined there are mostly too limited, and sometimes outdated, revision of ISO 8824 has been started. The latest draft is now awaiting publication.

The EDIFACT standard ISO 9735 specifies "levels" of character sets. Level A and B both contain the invariant-subset of ISO 646; B uses control characters for separators and terminators, A graphic characters ('+:?). Recently are added levels C, D, E, F. These correspond with the following parts of ISO 8859:

 
ISO 9735 ISO 8859
C 1
D 2
E 5 (Latin/Cyrillic)
F 7 (Latin/Greek)

Here Part 9, required for applying the Netherlands standard, is not (yet) included.

References for EDIFACT:

ISO 9735 Electronic data interchange for administration, commerce and transport (EDIFACT) - Application level syntax rules, 1988-07-15
ISO 9735 Electronic data interchange for administration, commerce and transport (EDIFACT) - Application level syntax rules, AMENDMENT 1, 1992-12-01