LampCMS open source XSLT based content management

The first open source CMS that uses browser based XSLT transformation

This CMS will store all text in utf8 charset
Mon Dec 7 2009
All data, including user-submitted blog posts, comments, etc
should be stored in utf8 charset

This will make the site more multilingual-friendly.

and also this makes parsing and storing rss feeds much easier, as well as generating xml feeds.

Some important points:

Always include and execute mysql query 'SET NAMES utf8' on every new DB connection. This will assure that utf8 is used for all inserts and all selects.

All mysql tables should be in utf8 charset

All incoming data should be checked to be a valid utf8 string, whick may not be very easy to do but we must do our best.

We must have a reliable way to convert string into utf8 but ONLY IF we detected that it's not already in utf8 format.

We must use mb_string extension for all string functions, including regex, strlen, strstr, explode, substr, etc... This is actually very important!

There are 3 different ways to convert into utf8:

simpleset one is utf8_encode() but it works ONLY for ISO-8859-1 strings (default encoding on the Internet today)

iconv() not a very good option, known to have problems, different versions produce different results but it has a good option //TRANSLIT
however //TRANSLIT option may not even be the best choise when we work with UTF8 Strings, it's a good choice when converting from one encoding to another non-utf8 encoding where the charset that we convert to may not have the same character as in charset we convert from, in which case iconv will pick the closest character. So in short, this may be a decent option when converting from utf8 to ISO-8859-1 or windows-1252 (which I don't plan on doing)

Third way is to use mb_convert_encoding() It's probably the best way and produces the best results, but for it to work properly we MUST know what encoding the string is currently in. We should not really ignore this meta data. Sometimes it's included in email messages, sometimes in forms submitted via browser (somewhere in headers), it's in XML feed file and it's actuall MUST always be utf8 in all RSS feeds. If we don't know the current encoding, then we must have the best way to detect (guess) the charset encoding, which is never 100% reliable, but mb_detect_encoding is the best bet.

I will have to think about how to best implement utf8 policy. It's not very easy. Basically a valid ASCII string is already a valid utf8 string, so we don't have to run utf8_encode on such strings

Also we should use the

mb_check_encoding