The GEDCOM Standard Release 5.5

Chapter 3

Using Character Sets in GEDCOM

Introduction

GEDCOM needs to accommodate different character sets to facilitate the sharing of genealogical data in different languages. To minimize the number of differing standards, we have chosen to have each system convert its usage to ANSEL, and eventually to UNICODE.

In January 1991, a Unicode Consortium was founded to promote the use of the Unicode standard, which accommodates most all characters in one character set. (See the section "Unicode".) The Unicode Consortium has agreed with the ISO 10646 standard to merge, and Unicode will be a subset of the ISO 10646 international character encoding standard.

Currently, it is difficult to handle the two- and four-character code sequences (wide characters). Therefore, until multi-byte handling becomes more common, ANSEL will be used to represent Latin-based characters.

The GEDCOM Standard does not address the implementation methods for multilingual processing, such as keyboard arrangements, sorting sequences, or character and graphic representations (font styles, proportional spacing, and so forth) on the CRT or printers. However, the Unicode standard has defined formatting characters that will indicate the direction of the text presentation and other text formatting character code.

Systems using code pages to support diacritical characters must convert all characters above character codes 128 to its ANSEL representation for that code page.

Most of the genealogy systems developed so far use ASCII, ANSEL, or both. ANSEL accommodates the set of Latin-based languages, as explained below.

8-Bit ANSEL

The 8-Bit ANSEL (American National Standard for Extended Latin Alphabet Coded Character Set for Bibliographic Use, Z39.47-1985 copyright) is the preferred character set for GEDCOM. It is used for all transmissions of information unless another character set is specified.

Using this character set standard makes it possible to preserve the full integrity of the language by providing a method of using the standard ASCII character set and supplementing it with both non-spacing character modifiers (diacritic) as well as spacing special characters.

Note:Non-spacing means that the diacritic is printed without advancing the device's print position. The character being modified is then printed in the same position, resulting in a combined image of both the character and the diacritic(s).

Storing ANSEL requires storing the non-spacing graphic character(s) preceding the ASCII character that the diacritic is to modify. The ANSEL standard specifies an extended 8-bit configuration (above 128) to represent the spacing and non-spacing graphic characters that make up most of the Latinbased languages. ANSEL is a super-set of ASCII. The standard ASCII characters including the control characters are preserved.

ANSEL is known by two other names:

ANSI Z39.47-1985
American Library Association character set, used in library systems worldwide, including the MARC (Machine-Readable Catalog) format.

A description of the codes for the ANSEL character set has been reproduced with permission and is included with the printed version of The GEDCOM Standard. The description of ANSEL codes is not included in the electronic version. This description may be purchased from%

American National Standards Institute
1430 Broadway
New York, N.Y. 10018

The description of the ANSEL character set standard includes the following:

An 8-Bit Code Table showing the ASCII and extended ANSEL codes
An explanation or legend of these codes
A chart that identifies the ANSEL Non-spacing Graphic Characters
A chart that identifies the ASCII Control Characters
A chart that identifies the ASCII Graphic Characters

Character set codes 0 through 127 are the same for 8-Bit ANSEL and 8-Bit ASCII (USA version%ANSI 8-Bit). Character set codes 128 through 255 are unique to the ANSEL character set.

ASCII (USA Version)

When a language does not need diacritic characters or other special characters, and if you are not transmitting binary data, you will find it convenient to use ASCII (8-bit USA version) if your computer already supports it. This is a standard of the American National Standards Institute (ANSI). Most of the basic printable characters of ANSEL and ASCII (USA version%ANSI 8-Bit) are identical.

UNICODE (ISO 10646)

The Unicode standard is a new character code designed to encode text for storage in computer files. It is a subset of the upcoming ISO 10646 standard. The design of the Unicode standard is based on the simplicity and consistency of today's prevalent character code set, extended ASCII code set, but goes far beyond ASCII's limited ability to encode only the Latin alphabet: the Unicode encoding provides the capacity to encode most all of the characters used for written languages throughout the world. In order to accommodate the many thousands of characters used in the international text, the Unicode standard uses a 16-bit code set instead of extended ASCII's 8-bit code set. This expansion provides codes for approximately 65,000 characters. The Unicode standard assigns each character a unique 16-bit value, and does not use complex modes or escape codes to specify modified characters or special cases. UNICODE may adopt a 32-bit code to represent characters which should allow for all character representations. The text representation of the Unicode 16-bit numbers is U+0041 which is assigned to the letter A, 65 decimal. The Unicode standard includes the Latin alphabet used for English, the Cyrillic alphabet used for Russian, the Greek, Hebrew, and Arabic alphabets. Otheralphabets used in countries across Europe, Africa, the Indian subcontinent, and Asia, such as Japanese Kana, Korean Hangul, and Chinese Bopomofo are included. The largest part of the Unicode standard is devoted to thousands of unified character codes for Chinese, Japanese, and Korean ideographs. (See "The Unicode standard", vol. 1 and 2, published by Addison-Wesley Publishing, for character code standards.)

The Unicode character set environment should eventually contain a set of character for all languages. If the Unicode environment is used to produce a GEDCOM transmission, the header record would also be in Unicode, requiring receiving systems to determine whether the transmission is Unicode or ASCII before they could interpret the GEDCOM header. This would be done by reading the first two bytes of the transmission. If the first two bytes are 0x30 and 0x20 then the transmission will be in either ASCII or ANSEL as determined by the header record. If the first two bytes are 0x30 and 0x00 then the transmission should be processed as a Unicode transmission. (Different platforms may reverse the position of the null byte, in which case the test would be for 0x00 and 0x30.)

How to Change Character Sets

The character set for an entire transmission is specified in the character set line of the header record.

The example below shows the specification in the header record:

Lvl Tag Value

  0 HEAD
    1 SOUR PAF
      2 VERS 2.1
    1 DEST ANSTFILE
    1 CHAR ANSEL

The character set change remains in effect until the TRLR record is encountered at the end of the transmission.

UNICODE character set should be used for multi-language support as soon as operating systems begin providing adequate storage and display support.

For more information about character sets, see the following:

Extended Latin Alphabet Coded Character Set for Bibliographic Use. American National Standards (ANSI), Z39.47, 1985.
"8-Bit ASCII%Structure and Rules." American National Standards (ANSI) X3.134.1%198x.
"7-Bit and 8-Bit ASCII Supplemental Multilingual Graphic Character Set (ASCII Multilingual Set)" (manuscript). American National Standards (ANSI), X3.134.2%198x.
"The Unicode standard", vol. 1 and 2, published by Addison-Wesley Publishing.