GA

2016/09/13

MySQL 8.0.0現在の文字コードについて

Planning the defaults for MySQL 5.8 | MySQL Server Blog の時点で

In addition to utf8mb4, we are also considering switching the default collation to be utf8mb4_unicode_520_ci.

と地雷宣言が為されていた文字コード問題。
(utf8mb4_unicode_520_ci は🍣と🍺を区別するけど、ハハとパパを区別してくれないヤーツ)

|                    | utf8mb4_bin | utf8mb4_general_ci | utf8mb4_unicode_ci | utf8mb4_unicode_520_ci|
|--------------------|-------------|--------------------|--------------------|-----------------------|
| Hiragana-Katakana  | cs (unkind) | cs (unkind)        | ci (good)          | ci(good)              |
| Youon              | cs (good)   | cs (good)          | ci (critical)      | ci(critical)          |
| Dakuten-Handakuten | cs (good)   | cs (good)          | ci (critical)      | ci(critical)          |
| Wide-Narrow        | cs (unkind) | cs (unkind)        | ci (good)          | ci(good)              |
| Sushi-Beer         | cs          | ci                 | ci                 | cs                    |

MySQL Bugs: #79977: utf8mb4_unicode_520_ci don't make sense for Japanese FTS


かみぽさんが「せっかくActiveRecordで直したのに今度はMySQLでデフォルトになるん…」みたいなことを言っていたやつ。

取り敢えず手元の8.0.0現在では、

$ bin/mysqld --no-defaults --help --verbose | egrep '(character-set|collation)-server'
mysqld: Can't change dir to '/usr/local/mysql/data/' (Errcode: 2 - No such file or directory)
2016-09-05T04:05:23.602489Z 0 [Warning] Changed limits: max_open_files: 1024 (requested 5000)
2016-09-05T04:05:23.602584Z 0 [Warning] Changed limits: table_open_cache: 431 (requested 2000)
2016-09-05T04:05:23.602946Z 0 [ERROR] Can't find error-message file '/usr/local/mysql/share/errmsg.sys'. Check error-message file location and 'lc-messages-dir' configuration directive.
  -C, --character-set-server=name
  --collation-server=name
character-set-server                                         latin1
collation-server                                             latin1_swedish_ci

文字コード/照合順序は5.7と同じくlatin1/latin1_swedish_ci。
その一方で、


mysql80> SHOW COLLATION LIKE 'utf8mb4%';
+----------------------------+---------+-----+---------+----------+---------+
| Collation                  | Charset | Id  | Default | Compiled | Sortlen |
+----------------------------+---------+-----+---------+----------+---------+
| utf8mb4_0900_ai_ci         | utf8mb4 | 255 |         | Yes      |       8 |
| utf8mb4_bin                | utf8mb4 |  46 |         | Yes      |       1 |
| utf8mb4_croatian_ci        | utf8mb4 | 245 |         | Yes      |       8 |
| utf8mb4_cs_0900_ai_ci      | utf8mb4 | 266 |         | Yes      |       8 |
| utf8mb4_czech_ci           | utf8mb4 | 234 |         | Yes      |       8 |
| utf8mb4_danish_ci          | utf8mb4 | 235 |         | Yes      |       8 |
| utf8mb4_da_0900_ai_ci      | utf8mb4 | 267 |         | Yes      |       8 |
| utf8mb4_de_pb_0900_ai_ci   | utf8mb4 | 256 |         | Yes      |       8 |
| utf8mb4_eo_0900_ai_ci      | utf8mb4 | 273 |         | Yes      |       8 |
| utf8mb4_esperanto_ci       | utf8mb4 | 241 |         | Yes      |       8 |
| utf8mb4_estonian_ci        | utf8mb4 | 230 |         | Yes      |       8 |
| utf8mb4_es_0900_ai_ci      | utf8mb4 | 263 |         | Yes      |       8 |
| utf8mb4_es_trad_0900_ai_ci | utf8mb4 | 270 |         | Yes      |       8 |
| utf8mb4_et_0900_ai_ci      | utf8mb4 | 262 |         | Yes      |       8 |
| utf8mb4_general_ci         | utf8mb4 |  45 | Yes     | Yes      |       1 |
| utf8mb4_german2_ci         | utf8mb4 | 244 |         | Yes      |       8 |
| utf8mb4_hr_0900_ai_ci      | utf8mb4 | 275 |         | Yes      |       8 |
| utf8mb4_hungarian_ci       | utf8mb4 | 242 |         | Yes      |       8 |
| utf8mb4_hu_0900_ai_ci      | utf8mb4 | 274 |         | Yes      |       8 |
| utf8mb4_icelandic_ci       | utf8mb4 | 225 |         | Yes      |       8 |
| utf8mb4_is_0900_ai_ci      | utf8mb4 | 257 |         | Yes      |       8 |
| utf8mb4_latvian_ci         | utf8mb4 | 226 |         | Yes      |       8 |
| utf8mb4_la_0900_ai_ci      | utf8mb4 | 271 |         | Yes      |       8 |
| utf8mb4_lithuanian_ci      | utf8mb4 | 236 |         | Yes      |       8 |
| utf8mb4_lt_0900_ai_ci      | utf8mb4 | 268 |         | Yes      |       8 |
| utf8mb4_lv_0900_ai_ci      | utf8mb4 | 258 |         | Yes      |       8 |
| utf8mb4_persian_ci         | utf8mb4 | 240 |         | Yes      |       8 |
| utf8mb4_pl_0900_ai_ci      | utf8mb4 | 261 |         | Yes      |       8 |
| utf8mb4_polish_ci          | utf8mb4 | 229 |         | Yes      |       8 |
| utf8mb4_romanian_ci        | utf8mb4 | 227 |         | Yes      |       8 |
| utf8mb4_roman_ci           | utf8mb4 | 239 |         | Yes      |       8 |
| utf8mb4_ro_0900_ai_ci      | utf8mb4 | 259 |         | Yes      |       8 |
| utf8mb4_sinhala_ci         | utf8mb4 | 243 |         | Yes      |       8 |
| utf8mb4_sk_0900_ai_ci      | utf8mb4 | 269 |         | Yes      |       8 |
| utf8mb4_slovak_ci          | utf8mb4 | 237 |         | Yes      |       8 |
| utf8mb4_slovenian_ci       | utf8mb4 | 228 |         | Yes      |       8 |
| utf8mb4_sl_0900_ai_ci      | utf8mb4 | 260 |         | Yes      |       8 |
| utf8mb4_spanish2_ci        | utf8mb4 | 238 |         | Yes      |       8 |
| utf8mb4_spanish_ci         | utf8mb4 | 231 |         | Yes      |       8 |
| utf8mb4_sv_0900_ai_ci      | utf8mb4 | 264 |         | Yes      |       8 |
| utf8mb4_swedish_ci         | utf8mb4 | 232 |         | Yes      |       8 |
| utf8mb4_tr_0900_ai_ci      | utf8mb4 | 265 |         | Yes      |       8 |
| utf8mb4_turkish_ci         | utf8mb4 | 233 |         | Yes      |       8 |
| utf8mb4_unicode_520_ci     | utf8mb4 | 246 |         | Yes      |       8 |
| utf8mb4_unicode_ci         | utf8mb4 | 224 |         | Yes      |       8 |
| utf8mb4_vietnamese_ci      | utf8mb4 | 247 |         | Yes      |       8 |
| utf8mb4_vi_0900_ai_ci      | utf8mb4 | 277 |         | Yes      |       8 |
+----------------------------+---------+-----+---------+----------+---------+
47 rows in set (0.00 sec)

utf8mb4の照合順序は結構増えてる。0900_ai_ciはUnicode 9.0.0ベースのAccentInsensitiveだろうか。どうも各言語ごとに0900_ai_ciが生えるような形をしてる。

MySQL :: WL#9108: Add language specific case insensitive collations of utf8mb4


「ai_ciが追加されるからunicode_ciはAccentSensitiveでいいよね」って流れになってもらわないと、このままutf8mb4_unicode_ciがデフォルトになる ( Planning the defaults for MySQL 5.8 | MySQL Server Blog ) のはつらい。WL#9108 見ても、japanese_ciを作る予定はなさそうだし(なさそうだしそもそも、罠いのが **暗黙のデフォルトになる** のが嫌なのだ俺は。。)

mysql80> SELECT 'ハハ' = 'パパ' COLLATE utf8mb4_unicode_ci;
+------------------------------------------------+
| 'ハハ' = 'パパ' COLLATE utf8mb4_unicode_ci     |
+------------------------------------------------+
|                                              1 |
+------------------------------------------------+
1 row in set (0.00 sec)

mysql80> SELECT 'ハハ' = 'パパ' COLLATE utf8mb4_general_ci;
+------------------------------------------------+
| 'ハハ' = 'パパ' COLLATE utf8mb4_general_ci     |
+------------------------------------------------+
|                                              0 |
+------------------------------------------------+
1 row in set (0.00 sec)

mysql80> SELECT 'ハハ' = 'パパ' COLLATE utf8mb4_unicode_520_ci;
+----------------------------------------------------+
| 'ハハ' = 'パパ' COLLATE utf8mb4_unicode_520_ci     |
+----------------------------------------------------+
|                                                  1 |
+----------------------------------------------------+
1 row in set (0.00 sec)



今後どうなるのか(特に 暗黙のデフォルトが 何になるのか)注目しておいた方がよさげ。

0 件のコメント :

コメントを投稿