木木彡blog

您现在的位置是：首页 > 学无止境

MySQL连接校对：utf8_general_ci与utf8

木木彡82 2015-03-05 10:26:34 共434人围观

转载自：
http://stackoverflow.com/questions/766809/whats-the-difference-between-utf8-general-ci-and-utf8-unicode-ci

http://segmentfault.com/q/1010000...

转载自：

http://stackoverflow.com/questions/766809/whats-the-difference-between-utf8-general-ci-and-utf8-unicode-ci

http://segmentfault.com/q/1010000000132450

There are at least two important differences:

Accuracy of sorting

utf8_unicode_ci is based on the Unicode standard for sorting, and sorts accurately in a very wide range of languages.

utf8_general_ci comes close to correct Unicode sorting in many common languages, but has a number of limitations: in some languages, it won't sort correctly at all. In others, it will merely have some quirks.
Performance

utf8_general_ci is faster at comparisons and sorting, because it takes a bunch of performance-related shortcuts.

utf8_unicode_ci uses a much more complex comparison algorithm which aims for correct sorting according in a very wide range of languages. This makes it slower to sort and compare large numbers of fields.

Unicode defines complex sets of rules for how characters should be sorted. These rules need to take into account language-specific conventions; not everybody sorts their characters in what we would call 'alphabetical order'.

As far as Latin (ie "European") languages go, there is not much difference between the Unicode sorting and the simplified utf8_general_ci sorting in MySQL, but there are still a few differences:

For examples, the Unicode collation sorts "ß" like "ss", and "Œ" like "OE" as people using those characters would normally want, whereas utf8_general_ci sorts them as single characters (presumably like "s" and "e" respectively).
In non-latin languages, such as Asian languages or languages with different alphabets, there may be a lot more differences between Unicode sorting and the simplified utf8_general_ci sorting. The suitability of utf8_general_ci will depend heavily on the language used. For some languages, it'll be quite inadequate.

Some Unicode characters are defined as ignorable, which means they shouldn't count toward the sort order and the comparison should move on to the next character instead. utf8_unicode_ci handles these properly.

What should you use?

There is almost never any reason to use utf_general_ci anymore, as we have left behind the point where CPU speed is low enough that the performance difference would be important. Your database will almost certainly be limited by quite other bottlenecks than this nowadays. The difference in performance is only going to be measurable in extremely specialised situations, and if that's you, you'd already know about it. If you're experiencing slow sorting, in almost all cases it'll be an issue with your indexes/query plan. Changing your collation function should not be high on the list of things to troubleshoot.

When I originally wrote this answer (over 4 years ago) I said that if you wanted, you could use utf8_general_ci most of the time, and only use utf8_unicode_ci when sorting was going to be important enough to justify the performance cost. However, the performance cost is no longer really relevant (and it may not have been back then, either). It's more important to sort properly in whichever language your users are using.

One other thing I'll add is that even if you know your application only supports the English language, it may still need to deal with people's names, which can often contain characters used in other languages in which it is just as important to sort correctly. Using the Unicode rules for everything helps add peace of mind that the very smart Unicode people have worked very hard to make sorting work properly.

区别在于字符对比上

请看mysql上面的例子：

对与general来说 ß = s 是为true的

但是对于unicode来说 ß = ss 才是为true的，

其实他们的差别主要在德语和法语上，所以对于我们中国人来说，一般使用general，因为general更快

如果你对德语和法语的对比有更高的要求，才使用unicode，它比general更准确一些（按照德语和法语的标准来说，在对比或者排序上更准确）

看看这个文档：http://dev.mysql.com/doc/refman/5.0/e...

另外还有utf8_bin_ci也是比较常用的哦，在字符对比时，unicode和general都不是大小写敏感的，所以如果要求大小写敏感的话，就使用bin

上一篇：【问底】徐汉彬：高并发Web服务的演变——节约系统内存和CP

下一篇：关于前端开发的20篇文档与指南