hellkvist.org Forum Index hellkvist.org
Discussions about the free software on hellkvist.org
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

tackling utf-16

 
This forum is locked: you cannot post, reply to, or edit topics.   This topic is locked: you cannot edit posts or make replies.    hellkvist.org Forum Index -> MMS Diary
View previous topic :: View next topic  
Author Message
Bernhard Goodwin
Guest





PostPosted: Tue Mar 29, 2005 5:15 am    Post subject: tackling utf-16 Reply with quote

Hi!

With the help of unicode.org and wikipedia I have tried to tackle utf-16-Conversion into utf-8:
http://www.unicode.org/faq/utf_bom.html#45
http://www.unicode.org/Public/PROGRAMS/CVTUTF/

I have added in diarylib.php

Code:

function lookup( $thisChar ) {
        //first call: only 16 Bit provided
   if (strlen ($thisChar) == 2) {
     // transform into a value:
     $thisCharVal=ord($thisChar{0})+
               ord($thisChar{1})*0x100;
     // if bin11011xxx xxxxxxxx: lookup needs a surrogate char
     if (($thisCharVal&0xD800)==0xD800) $result="more!";

     // if bin00000000 00xxxxxx: utf8-char is the utf16 lower byte
     else
     if ($thisCharVal<0x80)

       $result=$thisChar{0};

     // utf16 has to be translated into utf-8
     else {
       //$thisCharVal=0x00F8;
       // number of bytes to write is dependent on the value
       // also the Mask for the last (i.e. first) char is set
       if ($thisCharVal<0x800) {$bytesNum=2; $mark=0xC0;}
       else {$bytesNum=3; $mark=0xE0;}

       // Result-String is created
       $result="";
       for ( $i = 0; $i < ($bytesNum-1); $i++  ){
         // every char is bin 10xxxxxx
         $result = chr(($thisCharVal | 0x80) & 0xBF).$result;
         // shift down 6 Bit.
         $thisCharVal >>=6;
       }

       // 2 Bytes: the last (i.e. first) char is only 5 Bits left
       // it is marked with 110xxxxx
       // 3 Bytes: the last (i.e. first) char is only 4 Bits left
       // it is marked with 1110xxxx

       $result = chr($thisCharVal | $mark).$result;

     }
     // return $result
     return $result;
   }
   // second call: 32 Bit provided
   else
   if (strlen ($thisChar) == 4) {
     // transform low surrogate (2nd char) into a value:
     $lowSurrogateVal=ord($thisChar{2})+
            ord($thisChar{3})*0x100;
     // transform high surrogate (1st char) into a value:
     $highSurrogateVal=ord($thisChar{0})+
             ord($thisChar{1})*0x100;
     // check for correct values / check whether the mark is correct
     if ((($lowSurrogateVal&0xDC00)==0xDC00) &&
         (($highSurrogateVal&0xD800)==0xD800)) {$result="false!";return $result;}

     // High-Surrogate (U+D800 ... U+DBFF)
     // |15            8|7             0|
     // +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     // |1 1 0 1 1 0 Z Z|Z Z x x x x x x|
     // +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

     // Low-Surrogate (U+DC00 ... U+DFFF)
     // |15            8|7             0|
     // +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     // |1 1 0 1 1 1 y y|y y y y y y y y|
     // +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

     // UTF-32:
     // 31            24|23           16|15            8|7             0|
     // +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     // |0 0 0 0 0 0 0 0|0 0 0 z z z z z|x x x x x x y y|y y y y y y y y|
     // +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     // with zzzzz = ZZZZ+1

     // translate surrogate-pair into utf-32
     // Delete the Markers
     $lowSurrogateVal -= 0xDC00;
     $highSurrogateVal -= 0xD800;

     // add 1 to ZZZZ i.e. add 64 to lowSurrogate
     $highSurrogateVal += 64;
     // shift the xxxxxx
     $highSurrogateVal <<= 10;
     // combine high and low
     $thisCharVal = $lowSurrogateVal + $highSurrogateVal ;


     // four bytes to write
     $result="";
     for ( $i = 0; $i < 3; $i++  ){
       // every char is bin 10xxxxxx
       $result = chr(($thisCharVal | 0x80) & 0xBF).$result;
       // shift down 6 Bit.
       $thisCharVal >>=6;
     }
     // the last (i.e. first) char is only 3 Bits left
     // it is marked with 11110xxx
     $result = chr($thisCharVal | 0xE0).$result;
   }
   //if neither of the two condition is filled the input was wrong
   else $result="false!";
   return $result;
}


Lookup is used by the function text-decode:

Code:

function text_decode( $text ) {
   // Nokia sends text in UTF-16 encoding.
   // this transforms UTF16 into UTF8
   // for Conversions to other Charsets use e.g.
   // http://unix.freshmeat.net/projects/convertcharacterset/


   //UTF-16 comes in different Byte orders:
   //It starts with 0xFFFE or 0xFEFF depending on byte order
   //UTF-16BE (Big Endian)
   if ( ord( $text{0} ) == 0xff &&
        ord( $text{1} ) == 0xfe ){
     for ( $i = 2; $i < strlen( $text ); $i += 2 ){
       // translate the current character
       $curchar = lookup($text{$i}.$text{$i+1});

       // there has been an error with the length
       if ($i+1==strlen($text)) {$result = "I couldn't decode text:\n".
                        "Wrong length of the text";break;}

       // lookup gives "more!" back if it needs a surrogate char
       if ($curchar == "more!") {
         // there has been an error with the length (Missing surrogate char)
         if ($i+4<strlen($text)) {$result = "I couldn't decode text:\n".
                     "Wrong length of the text";break;}
         // surrogate char is the next two bytes
         $curchar = lookup($text{$i}.$text{$i+1}.$text{$i+2}.$text{$i+3});
         // counter is incremented extra
         $i+=2;
         }

       // there has been an error with lookup:
       if ($curchar == "false!") {$result = "I couldn't decode text:\n".
                         "An error with lookup()";break;}
       // curchar is added to the resultstring
       $result .= $curchar;
     }
     return $result;
   }

   //UTF-16LE (Little Endian) (lowerbyte is second byte)
   else
   if ( ord( $text{0} ) == 0xfe &&
        ord( $text{1} ) == 0xff ) {
     for ( $i = 2; $i < strlen( $text ); $i += 2 ){
       // translate the current character
       $curchar = lookup($text{$i+1}.$text{$i});

       // there has been an error with the length
       if ($i+1==strlen($text)) {$result = "I couldn't decode text:\n".
                        "Wrong length of the text";break;}

       // lookup gives "more!" back if it needs a surrogate char
       if ($curchar == "more!") {
         // there has been an error with the length (Missing surrogate char)
         if ($i+4<strlen($text)) {$result = "I couldn't decode text:\n".
                     "Wrong length of the text";break;}
         // surrogate char is the next two bytes
         $curchar = lookup($text{$i+1}.$text{$i}.$text{$i+3}.$text{$i+2});
         // counter is incremented extra
         $i+=2;
         }

       // there has been an error with lookup:
       if ($curchar == "false!") {$result = "I couldn't decode text:\n".
                         "An error with lookup()";break;}
       // curchar is added to the resultstring
       $result .= $curchar;
     }
     return $result;
   }
   return $text;
}


This implementation is rather long. It surely can be much shorter, but I wanted it to be understandable.

Bernie
--
http://www.BeGood.de
Back to top
Bernhard Goodwin
Guest





PostPosted: Tue Mar 29, 2005 2:59 pm    Post subject: Improvements Reply with quote

Ive kicked the lookup function and made the code a little bit shorter:

Code:

function text_decode( $text ) {
   // Nokia sends text in UTF-16 encoding.
   // this transforms UTF16 into UTF8
   // for Conversions to other Charsets use e.g.
   // http://unix.freshmeat.net/projects/convertcharacterset/

   $textLen=strlen( $text );

   // a UTF-16-String has to have an even number of bytes
   if ( ($textLen&0x01) != 0x00) return $text;

   //UTF-16 comes in different Byte orders:
   //It starts with 0xFFFE or 0xFEFF depending on byte order
   //UTF-16BE (Big Endian)
   //byte-order is given here (for the max of 4 Byte)
   if ( ord( $text{0} ) == 0xff &&
        ord( $text{1} ) == 0xfe )
         $bo=array(0,1,2,3);
   //UTF-16LE (Little Endian) (lowerbyte is second byte)
   else if ( ord( $text{0} ) == 0xfe &&
        ord( $text{1} ) == 0xff )
         $bo=array(1,0,3,2);
   // we assume it's not utf-16 if it doesn't start with 0xFFFE or 0xFEFF
   else return $text;

   // if the last char of the String is a low surrogate
   // the high surrogate is missing: not valid utf-16
   if (($text{$textLen-2+$bo[0]}&0xD8) == 0xD8) return $text;

   $result="";
   for ( $i = 2; $i < strlen( $text ); $i += 2 ){
     // ResultString for this Char is created
     $res="";

     // if bin110110xx [xxxxxxxx]: we need a surrogate char
     if ( ($text{$i+$bo[0]}&0xD8) == 0xD8 ) {
       // check whether next char is a valid low surrogate
       // i.e. bin110111xx [xxxxxxxx] -> Invalid Value is set
       if ( ($text{$i+$bo[3]}&0xDC) != 0xDC ) $curChar=0x110000;
       else {
         // surrogate char is the next two bytes
         // it is transformed into utf-32 at once
         // delete the markers for the surrogate pair
         // high surrogate
         $text{$i+$bo[1]} &= 0x03;
         // low surrogate
         $text{$i+$bo[3]} &= 0x03;

         // basis: low surrogate
         // then add high surrogate
         // and correction of the value:

         $curChar  = ( ord($text{$i+$bo[2]}) ) +
           ( ord($text{$i+$bo[3]}) << 8 ) +
           ( ord($text{$i+$bo[0]}) << 10 ) +
           ( ord($text{$i+$bo[1]}) << 18 ) +
             0x0010000;
         // counter is incremented extra
         $i+=2;
       }
     }
     else
       /* translate the current character */
       $curChar = ord($text{$i+$bo[0]})+
            (ord($text{$i+$bo[1]})<<8);

     // number of bytes to write is dependent on the value
     // also the mark for the last (i.e. first) char is set
     if ($curChar<0x80) {$bytesNum=1; $mark=0x00;}
     else if ($curChar<0x800) {$bytesNum=2; $mark=0xC0;}
     else if ($curChar<0x10000) {$bytesNum=3; $mark=0xE0;}
     else if ($curChar<0x110000) {$bytesNum=4; $mark=0xF0;}
     else $res="_";


     // each utf-8 surrogate byte is bin 10xxxxxx
     switch ($bytesNum) { /* note: everything falls through. */
     case 4: { $res = chr(($curChar | 0x80) & 0xBF).$res ; $curChar >>= 6; }
     case 3: { $res = chr(($curChar | 0x80) & 0xBF).$res ; $curChar >>= 6; }
     case 2: { $res = chr(($curChar | 0x80) & 0xBF).$res ; $curChar >>= 6; }
     // last (i.e. first) utf-8-char is marked for beginning of sequence
     // or not (if $bytesNum=1 than $mark=0x00)
     case 1: $res = chr(($curChar | $mark) ).$res ; }
     // curchar is added to the resultstring
     $result .= $res;
   }
   return $result;
}


All I've posted here of course is GNU.

The function short/without comments is only 40 lines. I think thats OK.

Bernie
--
http://www.BeGood.de
Back to top
Peffis
Site Admin


Joined: 09 Sep 2003
Posts: 324
Location: Sweden

PostPosted: Tue Mar 29, 2005 3:20 pm    Post subject: Reply with quote

Cool Bernard! Thanks for the information and feedback!
Back to top
View user's profile Send private message Visit poster's website
Display posts from previous:   
This forum is locked: you cannot post, reply to, or edit topics.   This topic is locked: you cannot edit posts or make replies.    hellkvist.org Forum Index -> MMS Diary All times are GMT + 1 Hour
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group