Evidentne o tom vite KULOVY. Viz:
http://www.unicode.org/reports/tr17/tr17-4.html
Konkretne:
---------------------------- ---------------------------------------
Examples of variable-width encoding forms:
UTF-8
used only with Unicode/10646: a mix of one to four 8-bit code units
in Unicode and one to six code units in 10646
UTF-16
used only with Unicode/10646: a mix of one to two 16 bit code units
---------------------------------------------------- ---------------
Takze teoreticky dokonce ne 2-3, ale 2-4 bajty pro Unicode znaky s kodem > 127.
Nicmene znaky, ktere je nutno zakodovat az na 4 bajty, se v ceskem kontextu
temer vubec nevyskytuji. Proto jsem napsal 2-3 bajty.
V Unicode 3.0 jsou nejen 2-bajtove znaky ale i 4-bajtove znaky. Nove verzi Unicode
totiz uz 2 bajty na zakodovani vsech potrebnych znaku nestacily. Tak pro jednodussi
pocty vzala rovnou bajty 4. UTF-8 potrebuje na zakodovani 16-bit Unicode 3 bajty,
na 32-bit Unicode potrebuje 4 bajty. Konverzi 16-bit Unicode ==> UTF-8 znazornuje
nasledujici kus kodu v Pascalu:
case (UnicodeCharWord) of
0..127: begin
AddString(pOutput, UnicodeCharChar[0]);
end; // 0..127
128..2047: begin
TempChar := Char(192 + (UnicodeCharWord div 64));
AddString(pOutput, TempChar);
TempChar := Char(128 + (UnicodeCharWord mod 64));
AddString(pOutput, TempChar);
end; // 128..2047
2048..65535: begin
TempChar := Char(224 + (UnicodeCharWord div 4096));
AddString(pOutput, TempChar);
TempChar := Char(128 + ((UnicodeCharWord div 64) mod 64));
AddString(pOutput, TempChar);
TempChar := Char(128 + (UnicodeCharWord mod 64));
AddString(pOutput, TempChar);
end; // 2048..65535
end; // case