Unicode
First of all : view 'visit it's even a nice read. Second of all, to get unicode characters into your html is not such a big deal, you can use the shortcut € however, not all characters have such a neat notation, and so you can also use the decimal value : € or the hexadecimal equivalent € By the way, to NOT get the euro sign in the last sentence, i use the & code.
But with the markdown code you might still end up with :
| euro code | |
|---|---|
So instead use the pre tags, you will get :
Unicode also called the Universal Character Set (UCS) is an encoding technique to encode larger character sets. The most commonly used sets are UTF-16 and UTF-8. UTF meaning : Universal Transformation Format and the number representing the base number of bits per character. Base, so it can be more that 8 bits for UTF-8 and more than 16 for UTF-16.Unicode characters are always the same, and have a stable code called a codepoint, which is written like U+X, where X is the codepoint, as a westerner you will probably only ever use 4 digit codepoints, but even that is not enough so even larger numbers can be present in the codepoint, an example for the euro sign is codepoint :
€ : U+20a0
Let's use this as a base file :
```html}
I have 20 €
Examining this file would give :
``` html
<pre>
echo "I have 20 €" > f
file f
x: UTF-8 Unicode text
xxd f
0000000: 4920 6861 7665 2032 3020 e282 ac0a I have 20 ....
</pre>
Actually this is also valid UTF-8 unicode, because all ascii characters are valid UTF-8 characters, but file detects a multi-byte character only in the first one and displays the smallest fitting file format.
To make it UTF-16, you need to convert the file, there are several ways of doing that, but iconv is a simple one :
| iconv | |
|---|---|
- -f is the from encoding
- -t is the to encoding
So now we have a new file called ff :
| file | |
|---|---|
Simply put.. all characters now occupy 16 bits, but also note that it starts with a magic fffe code, which both specifies that it is UTF-16. but also states the endianity (little in this case)
So this does take up more space, but you can imagine that string calculation will be much easier and faster now, because each character is 2-bytes. Note also that the euro sign in UTF-16 (0x20aC) is actually smaller than in UTF-8 (0xe282ac)
encoding
How to encode a codepoint into a UTF-8 character. UTF-16 has a similar though slightly different way of coding, so i will not get into that.
reference : visit
because it is UTF-8 encoded, 1 byte unicode chars match the pattern: 0aaaaaaa : so any character 0-127 is just it's ascii value, so all characters in our "I have 20 €" file are left alone up to e2, which is the first character with the first bit set.
codepoint U+20AC is indeed the euro sign.
in java
Java source file are meant to be unicode files, though simple ascii is mostly used. A file like this could be created :
<pre>
class Test
{
public static void main(String args[])
{
double π = 3.1415;
System.out.println(π);
}
}
</pre>
| output | |
|---|---|
Although discouraged, you could write programs like that.
scons
You will probably run into trouble if you want to compile the code above with scons :
Try to run this and you get errors like :
| output | |
|---|---|
This is because scons compiles in simple ASCII mode by default, you will have to actively change that by adding this line :
| ENV | |
|---|---|
python
Python internally has string that look a lot like c string, but it wants to default to ascii always.
All good, this is below 127 and thus usable as ascii and unicode which share characters 0-127. BUT now add a character above that range :
| or | |
|---|---|
The print will just not show the fourth character, repr() will give a 'printable representation'. Still no error but if you want to mix it with a unicode string it will go horribly wrong :
| mixing fails | |
|---|---|
This will give an error :
| output | |
|---|---|
You are now mixing unicode (u"") and ascii, and that fails. The whole string is now seen as unicode, and...