This CMS will store all text in utf8 charset
Mon Dec 7 2009
All data, including user-submitted blog posts, comments, etc
should be stored in utf8 charset
This will make the site more multilingual-friendly.
and also this makes parsing and storing rss feeds much easier, as well
as generating xml feeds.
Some important points:
Always include and execute mysql query 'SET NAMES utf8' on every new DB
connection. This will assure that utf8 is used for all inserts and all
selects.
All mysql tables should be in utf8 charset
All incoming data should be checked to be a valid utf8 string, whick
may not be very easy to do but we must do our best.
We must have a reliable way to convert string into utf8 but ONLY IF we
detected that it's not already in utf8 format.
We must use mb_string extension for all string functions, including
regex, strlen, strstr, explode, substr, etc... This is actually very
important!
There are 3 different ways to convert into utf8:
simpleset one is utf8_encode() but it works ONLY for ISO-8859-1 strings
(default encoding on the Internet today)
iconv() not a very good option, known to have problems, different
versions produce different results but it has a good option //TRANSLIT
however //TRANSLIT option may not even be the best choise when we work
with UTF8 Strings, it's a good choice when converting from one encoding to
another non-utf8 encoding where the charset that
we convert to may not have the same character as in charset we convert
from, in which case iconv will pick the closest character. So in short,
this may be a decent option when converting from
utf8 to ISO-8859-1 or windows-1252 (which I don't plan on
doing)
Third way is to use mb_convert_encoding() It's probably the best way
and produces the best results, but for it to work properly we MUST know
what encoding the string is currently in. We should
not really ignore this meta data. Sometimes it's included in email
messages, sometimes in forms submitted via browser (somewhere in headers),
it's in XML feed file and it's actuall MUST always be
utf8 in all RSS feeds. If we don't know the current encoding, then we
must have the best way to detect (guess) the charset encoding, which is
never 100% reliable, but mb_detect_encoding is the
best bet.
I will have to think about how to best implement utf8 policy. It's not
very easy. Basically a valid ASCII string is already a valid utf8 string,
so we don't have to run utf8_encode on such
strings
Also we should use the
mb_check_encoding