 |
hellkvist.org Discussions about the free software on hellkvist.org
|
View previous topic :: View next topic |
Author |
Message |
Bernhard Goodwin Guest
|
Posted: Tue Mar 29, 2005 5:15 am Post subject: tackling utf-16 |
|
|
Hi!
With the help of unicode.org and wikipedia I have tried to tackle utf-16-Conversion into utf-8:
http://www.unicode.org/faq/utf_bom.html#45
http://www.unicode.org/Public/PROGRAMS/CVTUTF/
I have added in diarylib.php
Code: |
function lookup( $thisChar ) {
//first call: only 16 Bit provided
if (strlen ($thisChar) == 2) {
// transform into a value:
$thisCharVal=ord($thisChar{0})+
ord($thisChar{1})*0x100;
// if bin11011xxx xxxxxxxx: lookup needs a surrogate char
if (($thisCharVal&0xD800)==0xD800) $result="more!";
// if bin00000000 00xxxxxx: utf8-char is the utf16 lower byte
else
if ($thisCharVal<0x80)
$result=$thisChar{0};
// utf16 has to be translated into utf-8
else {
//$thisCharVal=0x00F8;
// number of bytes to write is dependent on the value
// also the Mask for the last (i.e. first) char is set
if ($thisCharVal<0x800) {$bytesNum=2; $mark=0xC0;}
else {$bytesNum=3; $mark=0xE0;}
// Result-String is created
$result="";
for ( $i = 0; $i < ($bytesNum-1); $i++ ){
// every char is bin 10xxxxxx
$result = chr(($thisCharVal | 0x80) & 0xBF).$result;
// shift down 6 Bit.
$thisCharVal >>=6;
}
// 2 Bytes: the last (i.e. first) char is only 5 Bits left
// it is marked with 110xxxxx
// 3 Bytes: the last (i.e. first) char is only 4 Bits left
// it is marked with 1110xxxx
$result = chr($thisCharVal | $mark).$result;
}
// return $result
return $result;
}
// second call: 32 Bit provided
else
if (strlen ($thisChar) == 4) {
// transform low surrogate (2nd char) into a value:
$lowSurrogateVal=ord($thisChar{2})+
ord($thisChar{3})*0x100;
// transform high surrogate (1st char) into a value:
$highSurrogateVal=ord($thisChar{0})+
ord($thisChar{1})*0x100;
// check for correct values / check whether the mark is correct
if ((($lowSurrogateVal&0xDC00)==0xDC00) &&
(($highSurrogateVal&0xD800)==0xD800)) {$result="false!";return $result;}
// High-Surrogate (U+D800 ... U+DBFF)
// |15 8|7 0|
// +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
// |1 1 0 1 1 0 Z Z|Z Z x x x x x x|
// +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
// Low-Surrogate (U+DC00 ... U+DFFF)
// |15 8|7 0|
// +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
// |1 1 0 1 1 1 y y|y y y y y y y y|
// +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
// UTF-32:
// 31 24|23 16|15 8|7 0|
// +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
// |0 0 0 0 0 0 0 0|0 0 0 z z z z z|x x x x x x y y|y y y y y y y y|
// +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
// with zzzzz = ZZZZ+1
// translate surrogate-pair into utf-32
// Delete the Markers
$lowSurrogateVal -= 0xDC00;
$highSurrogateVal -= 0xD800;
// add 1 to ZZZZ i.e. add 64 to lowSurrogate
$highSurrogateVal += 64;
// shift the xxxxxx
$highSurrogateVal <<= 10;
// combine high and low
$thisCharVal = $lowSurrogateVal + $highSurrogateVal ;
// four bytes to write
$result="";
for ( $i = 0; $i < 3; $i++ ){
// every char is bin 10xxxxxx
$result = chr(($thisCharVal | 0x80) & 0xBF).$result;
// shift down 6 Bit.
$thisCharVal >>=6;
}
// the last (i.e. first) char is only 3 Bits left
// it is marked with 11110xxx
$result = chr($thisCharVal | 0xE0).$result;
}
//if neither of the two condition is filled the input was wrong
else $result="false!";
return $result;
}
|
Lookup is used by the function text-decode:
Code: |
function text_decode( $text ) {
// Nokia sends text in UTF-16 encoding.
// this transforms UTF16 into UTF8
// for Conversions to other Charsets use e.g.
// http://unix.freshmeat.net/projects/convertcharacterset/
//UTF-16 comes in different Byte orders:
//It starts with 0xFFFE or 0xFEFF depending on byte order
//UTF-16BE (Big Endian)
if ( ord( $text{0} ) == 0xff &&
ord( $text{1} ) == 0xfe ){
for ( $i = 2; $i < strlen( $text ); $i += 2 ){
// translate the current character
$curchar = lookup($text{$i}.$text{$i+1});
// there has been an error with the length
if ($i+1==strlen($text)) {$result = "I couldn't decode text:\n".
"Wrong length of the text";break;}
// lookup gives "more!" back if it needs a surrogate char
if ($curchar == "more!") {
// there has been an error with the length (Missing surrogate char)
if ($i+4<strlen($text)) {$result = "I couldn't decode text:\n".
"Wrong length of the text";break;}
// surrogate char is the next two bytes
$curchar = lookup($text{$i}.$text{$i+1}.$text{$i+2}.$text{$i+3});
// counter is incremented extra
$i+=2;
}
// there has been an error with lookup:
if ($curchar == "false!") {$result = "I couldn't decode text:\n".
"An error with lookup()";break;}
// curchar is added to the resultstring
$result .= $curchar;
}
return $result;
}
//UTF-16LE (Little Endian) (lowerbyte is second byte)
else
if ( ord( $text{0} ) == 0xfe &&
ord( $text{1} ) == 0xff ) {
for ( $i = 2; $i < strlen( $text ); $i += 2 ){
// translate the current character
$curchar = lookup($text{$i+1}.$text{$i});
// there has been an error with the length
if ($i+1==strlen($text)) {$result = "I couldn't decode text:\n".
"Wrong length of the text";break;}
// lookup gives "more!" back if it needs a surrogate char
if ($curchar == "more!") {
// there has been an error with the length (Missing surrogate char)
if ($i+4<strlen($text)) {$result = "I couldn't decode text:\n".
"Wrong length of the text";break;}
// surrogate char is the next two bytes
$curchar = lookup($text{$i+1}.$text{$i}.$text{$i+3}.$text{$i+2});
// counter is incremented extra
$i+=2;
}
// there has been an error with lookup:
if ($curchar == "false!") {$result = "I couldn't decode text:\n".
"An error with lookup()";break;}
// curchar is added to the resultstring
$result .= $curchar;
}
return $result;
}
return $text;
}
|
This implementation is rather long. It surely can be much shorter, but I wanted it to be understandable.
Bernie
--
http://www.BeGood.de |
|
Back to top |
|
 |
Bernhard Goodwin Guest
|
Posted: Tue Mar 29, 2005 2:59 pm Post subject: Improvements |
|
|
Ive kicked the lookup function and made the code a little bit shorter:
Code: |
function text_decode( $text ) {
// Nokia sends text in UTF-16 encoding.
// this transforms UTF16 into UTF8
// for Conversions to other Charsets use e.g.
// http://unix.freshmeat.net/projects/convertcharacterset/
$textLen=strlen( $text );
// a UTF-16-String has to have an even number of bytes
if ( ($textLen&0x01) != 0x00) return $text;
//UTF-16 comes in different Byte orders:
//It starts with 0xFFFE or 0xFEFF depending on byte order
//UTF-16BE (Big Endian)
//byte-order is given here (for the max of 4 Byte)
if ( ord( $text{0} ) == 0xff &&
ord( $text{1} ) == 0xfe )
$bo=array(0,1,2,3);
//UTF-16LE (Little Endian) (lowerbyte is second byte)
else if ( ord( $text{0} ) == 0xfe &&
ord( $text{1} ) == 0xff )
$bo=array(1,0,3,2);
// we assume it's not utf-16 if it doesn't start with 0xFFFE or 0xFEFF
else return $text;
// if the last char of the String is a low surrogate
// the high surrogate is missing: not valid utf-16
if (($text{$textLen-2+$bo[0]}&0xD8) == 0xD8) return $text;
$result="";
for ( $i = 2; $i < strlen( $text ); $i += 2 ){
// ResultString for this Char is created
$res="";
// if bin110110xx [xxxxxxxx]: we need a surrogate char
if ( ($text{$i+$bo[0]}&0xD8) == 0xD8 ) {
// check whether next char is a valid low surrogate
// i.e. bin110111xx [xxxxxxxx] -> Invalid Value is set
if ( ($text{$i+$bo[3]}&0xDC) != 0xDC ) $curChar=0x110000;
else {
// surrogate char is the next two bytes
// it is transformed into utf-32 at once
// delete the markers for the surrogate pair
// high surrogate
$text{$i+$bo[1]} &= 0x03;
// low surrogate
$text{$i+$bo[3]} &= 0x03;
// basis: low surrogate
// then add high surrogate
// and correction of the value:
$curChar = ( ord($text{$i+$bo[2]}) ) +
( ord($text{$i+$bo[3]}) << 8 ) +
( ord($text{$i+$bo[0]}) << 10 ) +
( ord($text{$i+$bo[1]}) << 18 ) +
0x0010000;
// counter is incremented extra
$i+=2;
}
}
else
/* translate the current character */
$curChar = ord($text{$i+$bo[0]})+
(ord($text{$i+$bo[1]})<<8);
// number of bytes to write is dependent on the value
// also the mark for the last (i.e. first) char is set
if ($curChar<0x80) {$bytesNum=1; $mark=0x00;}
else if ($curChar<0x800) {$bytesNum=2; $mark=0xC0;}
else if ($curChar<0x10000) {$bytesNum=3; $mark=0xE0;}
else if ($curChar<0x110000) {$bytesNum=4; $mark=0xF0;}
else $res="_";
// each utf-8 surrogate byte is bin 10xxxxxx
switch ($bytesNum) { /* note: everything falls through. */
case 4: { $res = chr(($curChar | 0x80) & 0xBF).$res ; $curChar >>= 6; }
case 3: { $res = chr(($curChar | 0x80) & 0xBF).$res ; $curChar >>= 6; }
case 2: { $res = chr(($curChar | 0x80) & 0xBF).$res ; $curChar >>= 6; }
// last (i.e. first) utf-8-char is marked for beginning of sequence
// or not (if $bytesNum=1 than $mark=0x00)
case 1: $res = chr(($curChar | $mark) ).$res ; }
// curchar is added to the resultstring
$result .= $res;
}
return $result;
}
|
All I've posted here of course is GNU.
The function short/without comments is only 40 lines. I think thats OK.
Bernie
--
http://www.BeGood.de |
|
Back to top |
|
 |
Peffis Site Admin
Joined: 09 Sep 2003 Posts: 324 Location: Sweden
|
Posted: Tue Mar 29, 2005 3:20 pm Post subject: |
|
|
Cool Bernard! Thanks for the information and feedback! |
|
Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2005 phpBB Group
|