NYCPHP Meetup

NYPHP.org

[nycphp-talk] PHP + UTF-8 + mb_string issue.

Anirudh Zala arzala at gmail.com
Wed Mar 21 01:20:26 EDT 2007


Hello Everybody,

While building a truly multilingual project, I am running into an interesting 
problem with php5 + utf-8 + mb_string. Please study below table carefully. I 
have taken 1 word in 3 different languages English, Finnish (of Finland 
country) and Gujarati (of India country) to test PHP's Unicode character set 
handling with single and multibyte strings using mb_string extension.

Word appearing on left of "=" sign is actual string whose length is to be 
counted. What I have tried here is to count length of word in each language. 
For English and Finnish I have got correct results but for Gujarati language 
it seems that mb_string(?) is not working properly.

=======================================================
zala = 1 word; 4 bytes; 4 characters (z, a, l, a); 4 key-strokes (z, a, l, a); 
"strlen" should be 4 and is 4 also.

zälä = 1 word; 4 bytes; 4 characters (z, ä, l, ä); 4 key-strokes (z, ä, l, ä); 
"strlen" should be 4 and is 4 also.

ઝાલા  = 1 word; 4 bytes; 2 characters (ઝા, લા); 4 key-strokes (ઝ, ા, લ, ા); 
"strlen" should be 2 but is 4.
=======================================================

Question is why PHP is not able to count length of given string in practical 
way. I am aware that current PHP versions are not aware of string, instead 
they just deal with bytes. In that case output is correct but this is not 
practical solution as length of word in Gujarati language is only "2" (In 
Indic languages, we have primary characters like "ઝ" and secondary characters 
like "ા", but there is not value of secondary characters without primary 
characters) and not "4" even if it requires 4 bytes to store data.

I am sure that I am not missing any settings to be done at server, php or at 
client level to work this correctly. English and Finnish languages are 
different languages but they are part of same character set (i.e Latin) and 
their glyph is also same, while Gujarati language has different character set 
and it's glyph is also different. But this should not create this problem if 
"mb_string" is capable to handle strings in proper way.

Thanks,

Anirudh Zala



More information about the talk mailing list