Character encoding
Transmission of short messages between the SMSC and the handset is done using the Mobile Application Part (MAP) of the SS7 protocol. Messages are sent with the MAP mo- and mt-ForwardSM operations, whose payload length is limited by the constraints of the signaling protocol to precisely 140 octets (140 octets = 140 * 8 bits = 1120 bits). Short messages can be encoded using a variety of alphabets: the default GSM 7-bit alphabet (shown below), the 8-bit data alphabet, and the 16-bit UTF-16/UCS-2 alphabet. Depending on which alphabet the subscriber has configured in the handset, this leads to the maximum individual Short Message sizes of 160 7-bit characters, 140 8-bit characters, or 70 16-bit characters. Support of the GSM 7-bit alphabet is mandatory for GSM handsets and network elements,[18] but characters in languages such as Arabic, Chinese, Korean, Japanese or Cyrillic alphabet languages (e.g. Russian) must be encoded using the 16-bit UCS-2 character encoding (see Unicode). Routing data and other meta data is additional to the payload size.
ISO 8859
The ISO 8859 standard is designed for reliable information exchange, not typography; the standard omits symbols needed for high-quality typography, such as optional ligatures, curly quotation marks, dashes, etc. As a result, high-quality typesetting systems often use proprietary or idiosyncratic extensions on top of the ASCII and ISO 8859 standards, or use Unicode instead.
As a rule of thumb, if a character or symbol was not already part of a widely used data-processing character set and was also not usually provided on typewriter keyboards for a national language, it didn't get in. Hence the directional double quotation marks « and » used for some European languages were included, but not the directional double quotation marks “ and ” used for English and some other languages. French didn't get its œ and Œ ligatures because they could be typed as 'oe'. Ÿ, needed for all-caps text, was left out as well. These characters were, however, included later with ISO 8859-15, which also introduced the new euro sign character €. Likewise Dutch did not get the 'ij' and 'IJ' letters, because Dutch speakers had gotten used to typing these as two letters instead. Romanian did not initially get its 'Ș/ș' and 'Ț/ț' (with comma) letters, because these letters were initially unified with 'Ş/ş' and 'Ţ/ţ' (with cedilla) by the Unicode Consortium, considering the shapes with comma beneath to be glyph variants of the shapes with cedilla. However, the letters with explicit comma below were later added to the Unicode standard and are also in ISO 8859-16.
Most of the ISO 8859 encodings provide diacritic marks required for various European languages using the Latin script. Others provide non-Latin alphabets: Greek, Cyrillic, Hebrew, Arabic and Thai. Most of the encodings contain only spacing characters although the Thai, Hebrew, and Arabic ones do also contain combining characters. However, the standard makes no provision for the scripts of East Asian languages (CJK), as their ideographic writing systems require many thousands of code points. Although it uses Latin based characters, Vietnamese does not fit into 96 positions (without using combining diacritics) either. Each Japanese syllabic alphabet (hiragana or katakana, see Kana) would fit, but like several other alphabets of the world they aren't encoded in the ISO 8859 system.
GSM 7-bit encoding (IA5)
It's a seven bit encoding of letters roughly matching the one of ascii characters. The lower 32 characters in the in the IA5 table is used for country specific characters (and don't expect all countries to use the same tables!!!). Let's just make it simple and say that '0' to '9', 'A' to 'Z' and 'a' to 'z' are placed identically to where the ascii characters are placed. This result in '0' being equal to '30 hex', 'A' being equal to '41 hex' and finally 'a' being equal to '61 hex'.
When encoding these characters in IA5, the 3 MSB is taken and displayed as an ascii value. For '0' the 3 MSB of the 7 bit value is '3'. This rest, the 4 LSB, is '0'. This gives the output '30' in ascii characters.
If we look at the message type '30' we were sending above, the message string was '48656C6C6F20576F726C64'. Let's decode that one.
The first two characters is '4' and '8'. This gives the hex value '48'. If 'A' is '41', '48' will be the eights letter ... 'H'. The next to characters is '65', which turn out to be fifth lower case letter 'e'. The complete string can be decoded in this fashion, and much to our surprise the string becomes 'Hello World'.
This code is an escape to an extension of the GSM 7 bit default alphabet table. A receiving entity which does not understand the meaning of this escape mechanism shall display it as a space character.
In the event that an MS receives a code where a symbol is not represented in the above table then the MS shall display the character shown in the main GSM 7 bit default alphabet table.
- This code value is reserved for the extension to another extension table. On receipt of this code, a receiving entity shall display a space until another extension table is defined. It is not intended that this extension mechanism should be used as an alternative to UCS2 to enhance the 7bit default alphabet character repertoire for national specific character sets.
- This code represents the EURO currency symbol. The code value is that used for the character ‘e’. Therefore a receiving entity which is incapable of displaying the EURO currency symbol will display the character ‘e’ instead.
- This code is defined as a Page Break character and may be used for example in compressed CBS messages. Any mobile station which does not understand the GSM 7 bit default alphabet table extension mechanism will treat this character as Line Feed.
GSM 3.38 specifications - download
Unicode
In computing, Unicode is an industry standard allowing computers to consistently represent and manipulate text expressed in most of the world's writing systems. Developed in tandem with the Universal Character Set standard and published in book form as The Unicode Standard, Unicode consists of a repertoire of about 100,000 characters, a set of code charts for visual reference, an encoding methodology and set of standard character encodings, an enumeration of character properties such as upper and lower case, a set of reference data computer files, and a number of related items, such as character properties, rules for text normalization, decomposition, collation, rendering and bidirectional display order (for the correct display of text containing both right-to-left scripts, such as Arabic or Hebrew, and left-to-right scripts).
The Unicode Consortium, the non-profit organization that coordinates Unicode's development, has the ambitious goal of eventually replacing existing character encoding schemes with Unicode and its standard Unicode Transformation Format (UTF) schemes, as many of the existing schemes are limited in size and scope and are incompatible with multilingual environments.
Unicode's success at unifying character sets has led to its widespread and predominant use in the internationalization and localization of computer software. The standard has been implemented in many recent technologies, including XML, the Java programming language, the Microsoft .NET Framework and modern operating systems.
Character converter:
from binary to UTF-16 text and vice versa.
Try with this binary content:
06270646062700200627062D06280640064006400640064006400640064006400640064006400640064006400643
You need to upgrade your Flash Player.
This pages requires Macromedia Flash, version 8 or greater.
Please click here to download.
The Unicode Standard has seven character encoding schemes: UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF-32LE
For more informations about Unicode visit Unicode home page