[An encoded] string only makes sense when we know what encoding it uses; otherwise, we cannot interpret it correctly.
string in memory as zeros and ones. That's what makes abstraction useful: it surfaces the context-specific stuff so we need not worry about the details. When we
string, we surface those underlying details. Running
encode gets us the zeros and ones that represent it memory. Running
What follows are notes on Joel Spolsky's blog post on character encodings.
ASCII represents "every character" using a number between 32 and 127 and uses the rest of a byte (8-bits) for control characters and special characters.
The ANSI Standard standardized the characters assigned to the numbers 0 to 127 and created "code pages" that specified different ways to handle number from 128 to 255.
At the point in history when ASCII ruled, a character was generally considered to be a byte. In other words, each ASCII character maps to some 8-bit number.
character -> number less than 255 -> 8 bits in memory e.g. A -> 65 -> 0100 0001
Enter Unicode where things go like this instead:
character -> code point -> some bits in memory e.g. A -> U+0041 -> ...
Those "some bits" in Unicode depend on the Unicode encoding. There are hundreds of Unicode encodings. Here are three:
- UCS2 / UTF-16 high-endian
- UCS2 / UTF-16 low-ending
Those encodings and others like it can store any Unicode code point. When some other encoding (e.g. ASCII) cannot correctly represent a Unicode code point we instead see question marks (�) or boxes.
How does UTF-8 encode more than 256 characters when it only uses 8-bits? Well, UTF-8 uses more than 8 bits. Joel writes: "In UTF-8, every code point from 0-127 is stored in a single byte [and] code points 128 and above are stored using 2, 3, in fact, up to 6 bytes." That allows UTF-8 to be compatible with the first 128 ASCII characters.
Some Other Useful References