8 general ci 和 utf8 unicode ci 的区别是什么

Between utf8_general_ci and utf8_unicode_ci, are there any differences in terms of performance?

转载于:https://stackoverflow.com/questions/766809/whats-the-difference-between-utf8-general-ci-and-utf8-unicode-ci

These two collations are both for the UTF-8 character encoding. The differences are in how text is sorted and compared.

Note: Since MySQL 5.5.3 you should use utf8mb4 rather than utf8. They both refer to the UTF-8 encoding, but the older utf8 had a MySQL-specific limitation preventing use of characters numbered above 0xFFFD.

Accuracy
utf8mb4_unicode_ci is based on the Unicode standard for sorting and comparison, which sorts accurately in a very wide range of languages.
utf8mb4_general_ci fails to implement all of the Unicode sorting rules, which will result in undesirable sorting in some situations, such as when using particular languages or characters.
Performance
utf8mb4_general_ci is faster at comparisons and sorting, because it takes a bunch of performance-related shortcuts.
On modern servers, this performance boost will be all but negligible. It was devised in a time when servers had a tiny fraction of the CPU performance of today's computers.
utf8mb4_unicode_ci, which uses the Unicode rules for sorting and comparison, employs a fairly complex algorithm for correct sorting in a wide range of languages and when using a wide range of special characters. These rules need to take into account language-specific conventions; not everybody sorts their characters in what we would call 'alphabetical order'.

As far as Latin (ie "European") languages go, there is not much difference between the Unicode sorting and the simplified utf8mb4_general_ci sorting in MySQL, but there are still a few differences:

For examples, the Unicode collation sorts "ß" like "ss", and "Œ" like "OE" as people using those characters would normally want, whereas utf8mb4_general_ci sorts them as single characters (presumably like "s" and "e" respectively).
Some Unicode characters are defined as ignorable, which means they shouldn't count toward the sort order and the comparison should move on to the next character instead. utf8mb4_unicode_ci handles these properly.

In non-latin languages, such as Asian languages or languages with different alphabets, there may be a lot more differences between Unicode sorting and the simplified utf8mb4_general_ci sorting. The suitability of utf8mb4_general_ci will depend heavily on the language used. For some languages, it'll be quite inadequate.

What should you use?

There is almost certainly no reason to use utf8mb4_general_ci anymore, as we have left behind the point where CPU speed is low enough that the performance difference would be important. Your database will almost certainly be limited by other bottlenecks than this.

The difference in performance is only going to be measurable in extremely specialised situations, and if that's you, you probably already know about it. If you're experiencing slow sorting, in almost all cases it'll be an issue with your indexes/query plan. Changing your collation function should not be high on the list of things to troubleshoot.

In the past, some people recommended to use utf8mb4_general_ci except when accurate sorting was going to be important enough to justify the performance cost. Today, that performance cost has all but disappeared, and developers are treating internationalization more seriously.

One other thing I'll add is that even if you know your application only supports the English language, it may still need to deal with people's names, which can often contain characters used in other languages in which it is just as important to sort correctly. Using the Unicode rules for everything helps add peace of mind that the very smart Unicode people have worked very hard to make sorting work properly.

See the mysql manual, Unicode Character Sets section:

For any Unicode character set, operations performed using the _general_ci collation are faster than those for the _unicode_ci collation. For example, comparisons for the utf8_general_ci collation are faster, but slightly less correct, than comparisons for utf8_unicode_ci. The reason for this is that utf8_unicode_ci supports mappings such as expansions; that is, when one character compares as equal to combinations of other characters. For example, in German and some other languages “ß” is equal to “ss”. utf8_unicode_ci also supports contractions and ignorable characters. utf8_general_ci is a legacy collation that does not support expansions, contractions, or ignorable characters. It can make only one-to-one comparisons between characters.

So to summarize, utf_general_ci uses a smaller and less correct (according to the standard) set of comparisons than utf_unicode_ci which should implement the entire standard. The general_ci set will be faster because there is less computation to do.

This post describes it very nicely.

In short: utf8_unicode_ci uses the Unicode Collation Algorithm as defined in the Unicode standards, whereas utf8_general_ci is a more simple sort order which results in "less accurate" sorting results.

I wanted to know what is the performance difference between using utf8_general_ci and utf8_unicode_ci, but I did not find any benchmarks listed on the Internet, so I decided to create benchmarks myself.

I created a very simple table with 500000 rows:

CREATE TABLE test(
  ID INT(11) DEFAULT NULL,
  Description VARCHAR(20) DEFAULT NULL
)
ENGINE = INNODB
CHARACTER SET utf8
COLLATE utf8_general_ci;

Then I filled it with random data by running this stored procedure:

CREATE PROCEDURE randomizer()
BEGIN
  DECLARE i INT DEFAULT 0;
  DECLARE random CHAR(20) ;

  theloop: loop
    SET random = CONV(FLOOR(RAND() * 99999999999999), 20, 36);

    INSERT INTO test VALUES (i+1, random);

    SET i=i+1;

    IF i = 500000 THEN
      LEAVE theloop;
    END IF;

  END LOOP theloop;
END

Then I created the following stored procedures to benchmark simple SELECT, SELECT with LIKE, and sorting (SELECT with ORDER BY):

CREATE benchmark_simple_select()
BEGIN
  DECLARE i INT DEFAULT 0;

  theloop: loop

    SELECT * FROM test WHERE Description = 'test' COLLATE utf8_general_ci;

    SET i = i + 1;

    IF i = 30 THEN
      LEAVE theloop;
      END IF;

  END LOOP theloop;

END

CREATE PROCEDURE benchmark_select_like()
BEGIN
  DECLARE i INT DEFAULT 0;

  theloop: loop

    SELECT * FROM test WHERE Description LIKE '%test' COLLATE utf8_general_ci;

    SET i = i + 1;

    IF i = 30 THEN
      LEAVE theloop;
      END IF;

  END LOOP theloop;

END

CREATE PROCEDURE benchmark_order_by()
BEGIN
  DECLARE i INT DEFAULT 0;

  theloop: loop

    SELECT * FROM test WHERE ID > FLOOR(1 + RAND() * (400000 - 1)) ORDER BY Description COLLATE utf8_general_ci LIMIT 1000;

    SET i = i + 1;

    IF i = 10 THEN
      LEAVE theloop;
      END IF;

  END LOOP theloop;

END

In the stored procedures above utf8_general_ci collation is used, but of course during the tests I used both utf8_general_ci and utf8_unicode_ci.

I called each stored procedure 5 times for each collation (5 times for utf8_general_ci and 5 times for utf8_unicode_ci) and then calculated the average values.

My results are:

benchmark_simple_select() with utf8_general_ci: 9957 ms
benchmark_simple_select() with utf8_unicode_ci: 10271 ms
In this benchmark using utf8_unicode_ci is slower than utf8_general_ci by 3.2%.

benchmark_select_like() with utf8_general_ci: 11441 ms
benchmark_select_like() with utf8_unicode_ci: 12811 ms
In this benchmark using utf8_unicode_ci is slower than utf8_general_ci by 12%.

benchmark_order_by() with utf8_general_ci: 11944 ms
benchmark_order_by() with utf8_unicode_ci: 12887 ms
In this benchmark using utf8_unicode_ci is slower than utf8_general_ci by 7.9%.

In brief words:

If you need better sorting order - use utf8_unicode_ci (this is the preferred method),

but if you utterly interested in performance - use utf8_general_ci, but know that it is a little outdated.

The differences in terms of performance are very slight.