Skip to content

Unicode

First of all : view 'visit it's even a nice read. Second of all, to get unicode characters into your html is not such a big deal, you can use the shortcut € however, not all characters have such a neat notation, and so you can also use the decimal value : € or the hexadecimal equivalent € By the way, to NOT get the euro sign in the last sentence, i use the &amp code.

But with the markdown code you might still end up with :

euro code
€

So instead use the pre tags, you will get :

<pre>
&euro;
</pre>
Unicode also called the Universal Character Set (UCS) is an encoding technique to encode larger character sets. The most commonly used sets are UTF-16 and UTF-8. UTF meaning : Universal Transformation Format and the number representing the base number of bits per character. Base, so it can be more that 8 bits for UTF-8 and more than 16 for UTF-16.

Unicode characters are always the same, and have a stable code called a codepoint, which is written like U+X, where X is the codepoint, as a westerner you will probably only ever use 4 digit codepoints, but even that is not enough so even larger numbers can be present in the codepoint, an example for the euro sign is codepoint :

€ : U+20a0

Let's use this as a base file :

```html}

I have 20 €

Examining this file would give :

``` html
<pre>
echo "I have 20 &euro;" > f
file f 
x: UTF-8 Unicode text
xxd f
0000000: 4920 6861 7665 2032 3020 e282 ac0a       I have 20 ....
</pre>
For contrast, without 'weird' characters like that :

1
2
3
4
5
echo "I have 20 euro" > f
file f
f: ASCII text
xxd f
0000000: 4920 6861 7665 2032 3020 6575 726f 0a    I have 20 euro.

Actually this is also valid UTF-8 unicode, because all ascii characters are valid UTF-8 characters, but file detects a multi-byte character only in the first one and displays the smallest fitting file format.

To make it UTF-16, you need to convert the file, there are several ways of doing that, but iconv is a simple one :

iconv
iconv -f UTF-8 -t UTF-16 f > ff
  • -f is the from encoding
  • -t is the to encoding

So now we have a new file called ff :

file
1
2
3
4
5
file ff
ff: Little-endian UTF-16 Unicode text, with no line terminators
xxd ff
0000000: fffe 4900 2000 6800 6100 7600 6500 2000  ..I. .h.a.v.e. .
0000010: 3200 3000 2000 ac20 0a00                 2.0. .. ..

Simply put.. all characters now occupy 16 bits, but also note that it starts with a magic fffe code, which both specifies that it is UTF-16. but also states the endianity (little in this case)

So this does take up more space, but you can imagine that string calculation will be much easier and faster now, because each character is 2-bytes. Note also that the euro sign in UTF-16 (0x20aC) is actually smaller than in UTF-8 (0xe282ac)

encoding

How to encode a codepoint into a UTF-8 character. UTF-16 has a similar though slightly different way of coding, so i will not get into that.

reference : visit

because it is UTF-8 encoded, 1 byte unicode chars match the pattern: 0aaaaaaa : so any character 0-127 is just it's ascii value, so all characters in our "I have 20 €" file are left alone up to e2, which is the first character with the first bit set.

file
2 byte patterns match this : 110bbbaa 10aaaaaa, and map on UTF-16
pattern 0000 0bbb aaaa aaaa, however, the code for the euro sign is :

e282 = 
1110 0010 0100 0010, it does not fit the 2 byte pattern :
110b bbaa 10aa aaaa

So we need more characters :
3 byte patterns match this : 1110bbbb 10bbbbaa 10aaaaaa and map onto UTF-16
pattern bbbb bbbb aaaa aaaa. So 

e282ac = 
1110 0010 0100 0010 1010 1100, that fits onto : 
1110 bbbb 10bb bbaa 10aa aaaa 
     0010   00 0010   10 1100 , rearranged : 
0010 0000 1010 1100           , that's 20AC 

codepoint U+20AC is indeed the euro sign.

in java

Java source file are meant to be unicode files, though simple ascii is mostly used. A file like this could be created :

<pre>
class Test
{
    public static void main(String args[])
    {
        double &pi; = 3.1415;

        System.out.println(&pi;);
    }
}
</pre>
And it compiles ok, running :

output
3.1415

Although discouraged, you could write programs like that.

scons

You will probably run into trouble if you want to compile the code above with scons :

scons
1
2
3
4
import glob;
env = Environment()
files = glob.glob("*.java")
env.Java(".",files);

Try to run this and you get errors like :

output
warning: unmappable character for encoding ASCII

This is because scons compiles in simple ASCII mode by default, you will have to actively change that by adding this line :

ENV
1
2
3
4
5
6
7
8
import glob;

env = Environment()
env['ENV']['LANG'] = 'en_US.UTF-8'

files = glob.glob("*.java")

env.Java(".",files);

python

Python internally has string that look a lot like c string, but it wants to default to ascii always.

python
[  a ][  b ][  c ] = "abc"
[ 97 ][ 98 ][ 99 ] = "abc"

All good, this is below 127 and thus usable as ascii and unicode which share characters 0-127. BUT now add a character above that range :

or
[ 97 ] [ 98 ] [ 99 ] [ 150 ] = "abc–"
repr
x = "abc" + chr(150)
print x
print repr(x)

The print will just not show the fourth character, repr() will give a 'printable representation'. Still no error but if you want to mix it with a unicode string it will go horribly wrong :

mixing fails
u"Hello" + x

This will give an error :

output
UnicodeDecodeError: 'ascii' codec can't decode byte 0x96 in position 3: ordinal not in range(128)

You are now mixing unicode (u"") and ascii, and that fails. The whole string is now seen as unicode, and...