Section3. COMMUNICATION AND CODING, ASCII

3. COMMUNICATION AND CODING, ASCII

Processing and transmission of texts with electronic means is only possible if the constituents have been transformed into appropriate units. We have seen that these consist in bits and bytes that are recorded on data carriers. What is needed is establishing a correspondence between the letters and tokens making up a text, and the bytes that the electronic system (computer or communication line) can process. In mathematics this is called a mapping of one set to another. We will speak of "coding" of a token with one or more bytes.

The sets we are considering cannot be chosen freely. That of bytes is finite, 7-bit bytes make 128 septets and 8-bit bytes make 256 octets. If one wants to specificy a 1-1 relation, then only maximally 128, or just 256 tokens can be coded. An element of a set thus restricted we call a "IT-character", in short "character". Then we can speak of a "coded character set", which we specify in a standard. This document contains an enumeration of all relations that exist between characters and bytes, when applying this standard.

The official ISO definitions shared by all the standards for coded character sets are thus:

character: A member of a set of elements used for the organization, control or representation of data.
coded character set; code: A set of unambiguous rules that establishes a character set and the one-to-one relationship between the characters of the set and their coded representation by one or more bit combinations.

This implies much more than only a representation of tokens. Indeed, there are two kinds of characters:

graphic character: A character, other than a control function, that has a visual representation normally handwritten, printed or displayed, and that has a coded representation consisting of one or more bit combinations.
graphic symbol: A visual representation of a graphic character or of a control function.
control function: An action that affects the recording, processing, transmission or interpretation of data, and that has a coded representation consisting of one or more bit combinations.
control character: A control function the coded representation of which consists of a single bit combination.

It is accepted usage at present to speak of a "repertoire", if a set of characters is meant for which a coding is defined (without indicating which). The ISO-definition is:

repertoire: A specified set of characters that are represented by means of one or more bit combinations of a coded character set.

After all these explanations, it is time to demonstrate with an example how a coded character set would look. We select ASCII, American Standard Code for Information Interchange, the simplest, and one of the the oldest.

ASCII is a 7-bit code. In the years after 1960 people thought that six bits were too little for coding texts. It was preferable to make small letters allowed as well. Most computers and printers only provided facilities for capital letters, a situation that lasted up to around 1983. Taking seven bits would not increase hardware costs too much. To this purpose ANSI (American National Standards Institute) adopted the standard X3.4. (ANSI was formerly called ASA and still later USASI.)

The structure of a coded character set becomes immediately clear when looking at a codetable. This is described with the following definitions:

code table: A table showing the character allocated to each bit combination in a code.
position: That part of a code table identified by its column and row co-ordinates.

The codetables in the ISO standards have a traditional form. Each byte is split into two parts. The "low order" bits are placed vertically, the "high order" bits horizontally. This means that the numeric values increase from above to below and not from left to right.

In the codetable for ASCII (TABLE 3) one can see that in every slot ("code position"), that corresponds to a byte, a character is shown, graphic characters in columns 2-7, control characters in 0 and 1, indicated with a two of three letter combination (explained in Chapter 7). The space is indicated with SP, DEL means DELETE, a special control character, in former days useful with paper tape and at present somewhat superfluous. To the codetable belongs a list of characters with their official names, their normal visualization ("default graphic symbol") and the corresponding byte in ISO-notatie x/y. There are 26 small and 26 capital letters included, the digits, and a collection of punctuation marks, supplemented by symbols from commerce or mathematics. Some of these latter, such as $ en @ betray the American origin of ASCII. In fact, these selection is the final phase of a long development. A good survey of the history of coding and the adoption of characters can be found in the book by C E Mackenzie that describes what happened up to around 1970.

Because the coding specifies for every character a numerical value, (that of the byte), even sometimes called the "ASCII"-value, there is a numerical ordering established between the characters, sometimes called the "collation sequence". This order serves sometimes as the base for sorting. But it is completely arbitrary, in particular for the special characters. Using this method, digits precede letters, and capital letters precede small letters. A good sorting order, on the contrary, is based on real users requirements, respecting what these think convenient, and not on what is most easy to produce to a software writer, or on what looks nice to a committee first confronted with the problem.

The table also contains some "free-standing" accents. These are rarely applied to the purpose for which they were created. Often the circumflex is called "control" and used in combination with a letter to produce a special effect. Other signs like $ % # @ frequently play a role in a piece of software that is quite different from what their name would make to expect. Finally, A B C D E F have sometimes the meaning of hexadecimal digits. Speaking of a classification of graphic characters thus has only sense with respect to a specific application, and not in general. A character has no universal meaning. Semantics is a matter of how it is used. Moreover, the term ASCII is frequently misused in a term like ASCII-file, that only indicates that the text does not contain codes for page layout, and not that the file is coded according to the ASCII-codetable.

TABLE  3

ASCII CODETABLE: RELATION BETWEEN BYTES AND CHARACTERS

        +---+---+---+---+---+---+---+---+
      b7| 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 |
      b6| 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 |
      b5| 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |
 bbbb   +---+---+---+---+---+---+---+---+
 4321   | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
+----+--+---+---+---+---+---+---+---+---+
|0000| 0|NUL|DLE| SP| 0 | @ | P | ` | p |
|0001| 1|SOH|DC1| ! | 1 | A | Q | a | q |
|0010| 2|STX|DC2| " | 2 | B | R | b | r |
|0011| 3|ETX|DC3| # | 3 | C | S | c | s |
|0100| 4|EOT|DC4| $ | 4 | D | T | d | t |
|0101| 5|ENQ|NAK| % | 5 | E | U | e | u |
|0110| 6|ACK|SYN| & | 6 | F | V | f | v |
|0111| 7|BEL|ETB| ' | 7 | G | W | g | w |
|1000| 8| BS|CAN| ( | 8 | H | X | h | x |
|1001| 9| HT| EM| ) | 9 | I | Y | i | y |
|1010|10| LF|SUB| * | : | J | Z | j | z |
|1011|11| VT|ESC| + | ; | K | ã | k | { |
|1100|12| FF| FS| , | < | L | \ | l | | |
|1101|13| CR| GS| - | = | M | ü | m | } |
|1110|14| SO| RS| . | > | N | ¬ | n | ~ |
|1111|15| SI| US| / | ? | O | _ | o |DEL|
+----+--+---+---+---+---+---+---+---+---+

CORRESPONDING DECIMAL VALUES

        +---+---+---+---+---+---+---+---+
      b7| 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 |
      b6| 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 |
      b5| 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |
 bbbb   +---+---+---+---+---+---+---+---+
 4321   | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
+----+--+---+---+---+---+---+---+---+---+
|0000| 0|  0| 16| 32| 48| 64| 80| 96|112|
|0001| 1|  1| 17| 33| 49| 65| 81| 97|113|
|0010| 2|  2| 18| 34| 50| 66| 82| 98|114|
|0011| 3|  3| 19| 35| 51| 67| 83| 99|115|
|0100| 4|  4| 20| 36| 52| 68| 84|100|116|
|0101| 5|  5| 21| 37| 53| 69| 85|101|117|
|0110| 6|  6| 22| 38| 54| 70| 86|102|118|
|0111| 7|  7| 23| 39| 55| 71| 87|103|119|
|1000| 8|  8| 24| 40| 56| 72| 88|104|120|
|1001| 9|  9| 25| 41| 57| 73| 89|105|1211
|1010|10| 10| 26| 42| 58| 74| 90|106|122|
|1011|11| 11| 27| 43| 59| 75| 91|107|123|
|1100|12| 12| 28| 44| 60| 76| 92|108|124|
|1101|13| 13| 29| 45| 61| 77| 93|109|125|
|1110|14| 14| 30| 46| 62| 78| 94|110|126|
|1111|15| 15| 31| 47| 63| 79| 95|111|127|
+----+--+---+---+---+---+---+---+---+---+