zhenlin wrote:Not if you're looking at >33% Greek (or Cyrillic, or any other character in the 0x80 and 0x07FF range) characters. Why? Because these take up 2 bytes each. So... In the case where 33% of all characters are in the 0x80 to 0x07FF range, the ratio between those and ASCII characters will be 1:1, an acceptable number. If it becomes greater than 1:1, it begins to make more sense to use an encoding that can effectively encode those characters in one byte.
When you start looking at CJK characters, you see 3 bytes per character... Where as UTF-16 will only have 2 bytes per character.
As you can see, there are trade-offs. UTF-8 is suitable for an majority-ASCII datascape... But how long is that going to last?
Maybe someone will develop a more effective way of encoding Unicode... Perhaps, a system with codepage selectors? Ack... That sounds nasty, and stupid, plus difficult to implement.
That's right. But let's use what solves the problem now.
(utf8, mod_gzip, etc.)
Using codepages isn't even an option, because we do use many characters and many languages. There are even characters in unicode for the international system units (like a Hz character), math stuff, etc. that can be really useful.