Radified Community Forums
Rad Community Technical Discussion Boards (Computer Hardware + PC Software) >> PC Hardware + Software (except Cloning programs) >> UTF-16 character encoding

Message started by Rad on Sep 10th, 2011 at 12:48pm

Title: UTF-16 character encoding
Post by Rad on Sep 10th, 2011 at 12:48pm
Trying to a handle on UTF-16 character encoding, which is used by JavaScript (ECMAScript).


From the 'Definitive Guide':

"A string is an immutable ordered sequence of 16-bit values, each of which typically represents a Unicode character."

I know about 'surrogate pairs' and how they work. That's not the problem.

From the Wiki link above:

The code points in each plane have the hexadecimal values xx0000 to xxFFFF, where xx is a hex value from 00 to 10, signifying which plane the values belong to.


The first plane (code points U+0000 to U+FFFF) contains the most frequently used characters and is called the Basic Multilingual Plane or BMP. UTF-16 encodes valid code points in this range as single 16-bit code units that are numerically equal to the corresponding code points.

My question is > where the the 16 bits? (Or maybe > ARE THERE 16 bits?)

When I think of a 'bit', I think of "0 or 1" (zero or one):



So, when I think of 16 bits, I think of something like (for example) > 0110100111001001 (.. should be 16 characters there).

But maybe my thinking is incorrect.

For example, these UTF-16 codepoints have 4 hexidecimal characters. So .. 2 to the 4th power = 16 (2x2x2x2). Is *that* what they mean my "16-bits"

It would seems that way, but the book says this (regarding an explanation of surrogate pairs):

uh, i cant find the character, but the text references the natural number 'e'. (It's sorts a forward leaning e.) Anyway, it says:

e is 1 character with 17-bit codepoint 0x1d452

I realize this is a codepoint and not the UTF-16 encoding for this particular codepoint. But the two seem to parallel each other in other ways.

Now, if my 2-to-the-4th-power theory were correct, the above (I believe) would be 32-bit .. no? (i.e. 2 to the 5th power).

By way of comparison, a regular 16-bit codepoint (for the letter/number pi) is > 0x03c0 (one less digit/charatcer than the one given for natural e above).

Anyway, I dont need to know *everything* about UTF-16 character encoding, but I would like to know > where are the 16 bits they are talking about in the values for UTF-16?

Title: Re: UTF-16 character encoding
Post by MudCrab on Sep 12th, 2011 at 8:59pm
I'm not an expert on bits and encoding, but I think the 16-bits is just the base value -- it uses 16-bits in the BMP. The surrogates use a total of 32-bits.

2 bytes (16-bits) = 1 BMP code point
4 bytes (32-bits) = non-BMP code point

The 17-bit code point is possibly referring to 17-bits being required for the value, not the storage.


Title: Re: UTF-16 character encoding
Post by Rad on Sep 15th, 2011 at 1:16pm
yeah, i got that much. Here's some further info:

Character encodings are one of the many things I've studied a lot over the years, because writing code used in asian countries is something that understanding of has been of real practical value.

However, there's an awful lot of computing history tied up in and expressed in character encodings; there are a lot of nuances to character encodings which come from accidents of history, and so to really understand how things are now you have to walk quite a long road from computing's early days.

In particular, there's an oft-assumed correspondence between the notion of an 8-bit "byte" as the unit of character encoding, and as the fundamental addressable unit in computer memory. However, these aren't quite the same thing at all. IBM mainframes just as the IBM 360 happened to be 32-bit machines, and IBM developed a character encoding to fit within the 32-bit machine word by packing 4 8-bit characters together, leading to the EBCDIC encoding (which had things like upper and lower case).

Later on, DEC had machines which happened to have a 36-bit machine word, and used a character encoding of just six bits per character to be able to pack six characters per machine word and CPU register, and later on again the DEC PDP-11 appeared - which was a 16-bit machine, but one in which the memory was able to be addressed in a "half" unit of 8 bits, and for which a programming language - C - has developed that mapped the notion of character to that. For these machines, the SIXBIT encoding of the earlier machine words was expanded to a 7-bit encoding (with one bit left spare) that formed the basis of first the US ASCII standard and then a host of global standards (under ISO 646) derived from it.

Almost all the early Internet protocols and standards were written with an eye to the difference between characters and bytes because almost all machines at the time had different ideas about these things; so, textual data would be sent between machines that used radically different encodings an numbers of bits per character. It so happened that the microprocessor era - arriving after the PDP-11 - fixed on 8 bits as the fundamental unit of everything and so over the last 30 years that has become universal, but it wasn't always so.

So, some background. Throughout the 1980's, especially as UNIX and C on microprocessors spread into Asian countries and most notably Japan, the problem of dealing with very large character sets became of practical importance. However, there were 3 fronts to this:

a) The problem of data processing within systems; the languages and tools of the microprocessor era, especially C, did not make any distinction between a character and a byte. The ISO 646 national standards proliferated by mostly switching between incompatible single-byte encodings, but for Asian languages there was no choice but to encode characters as variable-length sequences of bytes; this broke the correspondence between byte = character fundamentally and meant that many programs did not work on non-US characters at all.

b) The problem of data interchange between systems; to even move data in single-byte encodings from machine to machine, you needed to know which of the national variant formats it was encoded in so it could be successfully interpreted.

c) The problem of meaningfully displaying text; bitmapped displays were on the rise, so that more and more different display forms of characters were becoming possible, but many asian and arabic languages there is not a direct correspondence between how things display (their visual form) and the actual language form or encoding, not just in terms of differences in reading direction (some languages are right-to-left or bottom-to-top) but in other ways too - in written Thai, vowels in the language are represented as tone marks above and below consonants, while Arabic uses a relatively small alphabet but is rendered using an incredible variety of ligatured forms so that character display is context-sensitive.

So, in the 80's several things happened; a) and b) in combination lead to pursuing a single 16-bit "universal" character encoding which could represent basically every active natural language which had been through the ISO 646 process, so that the interchange and processing problem could be solved by transcoding from separate national encodings (some of which required multi-byte sequences, as e.g. Japanese) through the wider 16-bit "universal" encoding. An industry consortium, called the Unicode Foundation, was formed to create the encoding.

At the same time, the ISO bodies wanted to standardize computer languages, and so a huge part of the ISO effort for C which culminated in the 1989 standard was to add machinery to the language to require support for multi-byte character encodings so that programs could be reliably written for the existing ISO 646 encodings, and also to add support for things like the 16-bit "Universal" encoding of Unicode, although they hedged their bets about what the exact universal "wide" encoding would be.

As for c), one of the things that Unicode tried to deal with and force people to face was the difference between the notion of an encoded character, which was important to standardize for data interchange, and that is was different to the notion of what a character looked like (which we call a glyph). Programmers of all stripes tended to assume because western languages had a one-to-one correspondence between these things that they were the same concept, whereas they really are different things in other languages. So, Unicode carefully specified the encodings, but made a big deal of the fact that glyphs were a different thing altogether and that people should stop assuming they were.

Title: Re: UTF-16 character encoding
Post by Rad on Sep 15th, 2011 at 1:19pm
cont'd from above:

Now, on to the questions:
>> When I think of a 'bit', I think of "0 or 1" (zero or one)

Right. And those bits in 1980's machines and encodings and computer languages were arranged in 8-bit groups. Bigger data was always then formed out of multiples of those units, so to go any bigger you needed two 8-bit groups, thus 16 bits.

>> For example, these UTF-16 codepoints have 4 hexidecimal characters. So .. 2 to the 4th power = 16 (2x2x2x2). Is *that* what they mean by "16-bits"

Well, the hexadecimal character thing comes from the same root in 8-bit bytes; it's handy to represent those 8-bit bytes out of two symmetric halves of 4 bits, instead of the 1960's convention of using octal (base 8) for representing numbers in written form. Octal uses 3-bit groups and the boundaries between units don't line up any more in terms of 8-bit bytes because 3 into 8 doesn't work cleanly. Octal was the dominant form in olden times before computers used 8-bit bytes, but not so much since.

So, both things work out. Hexadecimal is a tradition of the 8-bit byte world because it composes well when joining bytes together into longer units - the hexadecimal representation of two bytes (which have 16 bits) are the hex representations of the component bytes written side by side.

>> I realize this is a codepoint and not the UTF-16 encoding for this particular codepoint. But the two seem to parallel each other in other ways.

Yes, because of historical accident; the ISO 10646 started out as an evolution of the 16-bit Unicode specification. The Unicode people anticipated that 16 bits would not be enough, while at the same time struggling with the fact that even transitioning THE ENTIRE WORLDWIDE COMPUTER INDUSTRY to 16 bits was going to be awfully hard (32 bits would have been politically and technically infeasible at the time they started), and so they left some unused space for future extension to an even bigger encoding space.

Computer environments such as Windows NT which started out at the same time as Unicode was settled used it as a universal 16-bit encoding (which at the time, was called either UCS-2 or Unicode). It wasn't until ***much later on*** that the ISO 10646 process reached consensus on the universal encoding as being one requiring 4 8-bit units to represent any possible character.

Once ISO 10646 became settled, what happened was that the notion of UTF-16 (with its surrogate pairs) was set up as a way of retrofitting the newer standard with its even bigger characters into systems which had already been using Unicode for many, many years.

So, ISO 10646 and UCS-4 started out with Unicode and its notion of UCS-2 as its basis, and in particular took every existing code point in Unicode and carried it forward into ISO 10646. So, systems which only understood the Unicode specification would still work reasonably well with the newer, expanded specification, with surrogate pairs allowing a swathe of the expanded UCS-4 encoding to be used in older systems which had committed fully to the concept of characters as 16 bits wide, just as in the 1980's support for variable-length encodings was used at first to retrofit support for non-English languages into the 8-bit systems.

The big difference between the two transitions is that in UCS-2, the assumption of the correspondence between character and glyph was already broken, so virtually everything that understood Unicode handled the UTF-16 extensions sensibly, and because the UTF-16 surrogate pairs were just pairs, whereas in the 1980's, characters which needed 3 or 4-byte sequences broke almost every existing program.

So although a few people make a big deal about the difference between UTF-16 and classic Unicode, the reality is that classic Unicode did meet its goal of representing the entire world's communication needs as they were in the 1980's, and that the conceptual changes involved in the expansion to the larger ISO 10646 encoding breaks only a tiny few existing programs (the ones that didn't take the distinction between character and glyph seriously) anyway.

Radified Community Forums » Powered by YaBB 2.4!
YaBB © 2000-2009. All Rights Reserved.