strings and encodings

We’re all familiar with how strings are represented in our programming languages.

But computers only use binary. How is this represented in memory, in files on disk, etc.?

We need to store characters as numbers, e.g., A = 1, B = 2, C = 3, etc.

One of the most popular encoding standards has been ASCII—the American Standard Code for Information Interchange.

It defined a 7-bit code (that is, the values from 0 to 127) that gave letters, digits, and common punctuation a unique integer code.

Here's the mapping that was standardised in ASCII. As I mentioned, it was a 7-bit code. Back in those days, every single bit was precious and they decided that since they only needed seven bits to represent the alphabet and some useful punctuation, that's all they would use.

The astute members of the audience may have noticed a slight problem here. What if you had the gall to be Spanish, French, German, or even Canadian? How would you encode your funny little letters using this standard?

By this time, the eight-bit byte was pretty common (yes, kids, there used to be different numbers of bits in a byte), and people noticed that that eighth bit wasn't doing anything useful. By using it, we could get another 128 characters. Thus was born "extended ASCII". Here's an example, later standardised as the ISO-8859-1 code page.

So now everyone's happy, right?

Well, except the Russians. At this point, we've run out of bits. So we've got no choice but to reuse those 128 integers for Cyrillic letters.

Similarly, for Greek, we can redefine all those positions to mean something else. Let's just hope the Greeks never need to send the Russians an email.

Hopefully you started getting nervous when I said "we can redefine all those positions" on the previous slide. After all, if we redefine what the numbers mean, how can we interpret arbitrary data that we find on a disk or on the web?

This slide shows the various ways these four bytes can be decoded using various ISO-8859 code pages. I don't read Russian, but even I can tell that that string on the left is garbage.

There's also the problem of "how can I use Cyrillic and Greek in the same document?" And what about the Koreans and Chinese, who have thousands—not just 128—additional characters to encode? Even English users might like to use hundreds of technical and punctuation symbols, dingbats, and emoji. Let's solve that first.

Unicode was designed to solve the problems of multilingual documents. It assigns each character a unique integer and defines standard encoding formats for saving strings as byte sequences.

Since its definition, it's grown from a few thousand characters to over 100,000, and more characters are being proposed.

Unicode allows us to write a document containing thousands of different characters, but still doesn't solve interoperability problems.

When opening an arbitrary file, you still have to choose an encoding to interpret the bytes. Say our string from before was encoded in UTF-8 (you'll notice it's now twice as long); if we incorrectly read that with the ISO-8859-1 encoding, we get the corrupted output shown below.

As I said before, bytes by themselves do not say what encoding they are. That information has to come from somewhere else.

There are three common ways the encoding is specified.

1. In a standard that defines the format. For example, JSON is defined as being encoded in UTF-8.2. As metadata that's transmitted with the content. For example, a HTTP response header specifies

the charset of the HTML document that is the body of the response.3. In the content itself. This is more complicated, because you have to read the content in order to

determine how to read the content.

Finally, if none of those works, you can try content sniffing. Each encoding has certain combinations of bytes that are more or less likely than others. Based on the frequency distribution you can try to guess the best encoding. Mozilla has an open source charset detection library.

On Windows, the two most common encodings are Windows-1252 (which is very similar to ISO-8859-1, shown earlier) and UTF-8. Using the wrong encoding to read bytes will give you the strings shown above.

Look carefully at and memorise this sample output: I guarantee that you will see it in your career. The first thing you should think when you see this pattern is: "encoding error".

Just remember: "there ain't no such thing as plain text". You always need an encoding to read and write text data. When writing, if in doubt, use UTF-8. When reading, there's no good default.

Questions?

strings and encodings

Documents