In addition to utf8mb4, we are also considering switching the default collation to be utf8mb4_unicode_520_ci.
と地雷宣言が為されていた文字コード問題。
(utf8mb4_unicode_520_ci は🍣と🍺を区別するけど、ハハとパパを区別してくれないヤーツ)
| | utf8mb4_bin | utf8mb4_general_ci | utf8mb4_unicode_ci | utf8mb4_unicode_520_ci| |--------------------|-------------|--------------------|--------------------|-----------------------| | Hiragana-Katakana | cs (unkind) | cs (unkind) | ci (good) | ci(good) | | Youon | cs (good) | cs (good) | ci (critical) | ci(critical) | | Dakuten-Handakuten | cs (good) | cs (good) | ci (critical) | ci(critical) | | Wide-Narrow | cs (unkind) | cs (unkind) | ci (good) | ci(good) | | Sushi-Beer | cs | ci | ci | cs |
MySQL Bugs: #79977: utf8mb4_unicode_520_ci don't make sense for Japanese FTS
かみぽさんが「せっかくActiveRecordで直したのに今度はMySQLでデフォルトになるん…」みたいなことを言っていたやつ。
取り敢えず手元の8.0.0現在では、
$ bin/mysqld --no-defaults --help --verbose | egrep '(character-set|collation)-server' mysqld: Can't change dir to '/usr/local/mysql/data/' (Errcode: 2 - No such file or directory) 2016-09-05T04:05:23.602489Z 0 [Warning] Changed limits: max_open_files: 1024 (requested 5000) 2016-09-05T04:05:23.602584Z 0 [Warning] Changed limits: table_open_cache: 431 (requested 2000) 2016-09-05T04:05:23.602946Z 0 [ERROR] Can't find error-message file '/usr/local/mysql/share/errmsg.sys'. Check error-message file location and 'lc-messages-dir' configuration directive. -C, --character-set-server=name --collation-server=name character-set-server latin1 collation-server latin1_swedish_ci
文字コード/照合順序は5.7と同じくlatin1/latin1_swedish_ci。
その一方で、
mysql80> SHOW COLLATION LIKE 'utf8mb4%'; +----------------------------+---------+-----+---------+----------+---------+ | Collation | Charset | Id | Default | Compiled | Sortlen | +----------------------------+---------+-----+---------+----------+---------+ | utf8mb4_0900_ai_ci | utf8mb4 | 255 | | Yes | 8 | | utf8mb4_bin | utf8mb4 | 46 | | Yes | 1 | | utf8mb4_croatian_ci | utf8mb4 | 245 | | Yes | 8 | | utf8mb4_cs_0900_ai_ci | utf8mb4 | 266 | | Yes | 8 | | utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 | | utf8mb4_danish_ci | utf8mb4 | 235 | | Yes | 8 | | utf8mb4_da_0900_ai_ci | utf8mb4 | 267 | | Yes | 8 | | utf8mb4_de_pb_0900_ai_ci | utf8mb4 | 256 | | Yes | 8 | | utf8mb4_eo_0900_ai_ci | utf8mb4 | 273 | | Yes | 8 | | utf8mb4_esperanto_ci | utf8mb4 | 241 | | Yes | 8 | | utf8mb4_estonian_ci | utf8mb4 | 230 | | Yes | 8 | | utf8mb4_es_0900_ai_ci | utf8mb4 | 263 | | Yes | 8 | | utf8mb4_es_trad_0900_ai_ci | utf8mb4 | 270 | | Yes | 8 | | utf8mb4_et_0900_ai_ci | utf8mb4 | 262 | | Yes | 8 | | utf8mb4_general_ci | utf8mb4 | 45 | Yes | Yes | 1 | | utf8mb4_german2_ci | utf8mb4 | 244 | | Yes | 8 | | utf8mb4_hr_0900_ai_ci | utf8mb4 | 275 | | Yes | 8 | | utf8mb4_hungarian_ci | utf8mb4 | 242 | | Yes | 8 | | utf8mb4_hu_0900_ai_ci | utf8mb4 | 274 | | Yes | 8 | | utf8mb4_icelandic_ci | utf8mb4 | 225 | | Yes | 8 | | utf8mb4_is_0900_ai_ci | utf8mb4 | 257 | | Yes | 8 | | utf8mb4_latvian_ci | utf8mb4 | 226 | | Yes | 8 | | utf8mb4_la_0900_ai_ci | utf8mb4 | 271 | | Yes | 8 | | utf8mb4_lithuanian_ci | utf8mb4 | 236 | | Yes | 8 | | utf8mb4_lt_0900_ai_ci | utf8mb4 | 268 | | Yes | 8 | | utf8mb4_lv_0900_ai_ci | utf8mb4 | 258 | | Yes | 8 | | utf8mb4_persian_ci | utf8mb4 | 240 | | Yes | 8 | | utf8mb4_pl_0900_ai_ci | utf8mb4 | 261 | | Yes | 8 | | utf8mb4_polish_ci | utf8mb4 | 229 | | Yes | 8 | | utf8mb4_romanian_ci | utf8mb4 | 227 | | Yes | 8 | | utf8mb4_roman_ci | utf8mb4 | 239 | | Yes | 8 | | utf8mb4_ro_0900_ai_ci | utf8mb4 | 259 | | Yes | 8 | | utf8mb4_sinhala_ci | utf8mb4 | 243 | | Yes | 8 | | utf8mb4_sk_0900_ai_ci | utf8mb4 | 269 | | Yes | 8 | | utf8mb4_slovak_ci | utf8mb4 | 237 | | Yes | 8 | | utf8mb4_slovenian_ci | utf8mb4 | 228 | | Yes | 8 | | utf8mb4_sl_0900_ai_ci | utf8mb4 | 260 | | Yes | 8 | | utf8mb4_spanish2_ci | utf8mb4 | 238 | | Yes | 8 | | utf8mb4_spanish_ci | utf8mb4 | 231 | | Yes | 8 | | utf8mb4_sv_0900_ai_ci | utf8mb4 | 264 | | Yes | 8 | | utf8mb4_swedish_ci | utf8mb4 | 232 | | Yes | 8 | | utf8mb4_tr_0900_ai_ci | utf8mb4 | 265 | | Yes | 8 | | utf8mb4_turkish_ci | utf8mb4 | 233 | | Yes | 8 | | utf8mb4_unicode_520_ci | utf8mb4 | 246 | | Yes | 8 | | utf8mb4_unicode_ci | utf8mb4 | 224 | | Yes | 8 | | utf8mb4_vietnamese_ci | utf8mb4 | 247 | | Yes | 8 | | utf8mb4_vi_0900_ai_ci | utf8mb4 | 277 | | Yes | 8 | +----------------------------+---------+-----+---------+----------+---------+ 47 rows in set (0.00 sec)
utf8mb4の照合順序は結構増えてる。0900_ai_ciはUnicode 9.0.0ベースのAccentInsensitiveだろうか。どうも各言語ごとに0900_ai_ciが生えるような形をしてる。
MySQL :: WL#9108: Add language specific case insensitive collations of utf8mb4
「ai_ciが追加されるからunicode_ciはAccentSensitiveでいいよね」って流れになってもらわないと、このままutf8mb4_unicode_ciがデフォルトになる ( Planning the defaults for MySQL 5.8 | MySQL Server Blog ) のはつらい。WL#9108 見ても、japanese_ciを作る予定はなさそうだし(なさそうだしそもそも、罠いのが **暗黙のデフォルトになる** のが嫌なのだ俺は。。)
mysql80> SELECT 'ハハ' = 'パパ' COLLATE utf8mb4_unicode_ci; +------------------------------------------------+ | 'ハハ' = 'パパ' COLLATE utf8mb4_unicode_ci | +------------------------------------------------+ | 1 | +------------------------------------------------+ 1 row in set (0.00 sec) mysql80> SELECT 'ハハ' = 'パパ' COLLATE utf8mb4_general_ci; +------------------------------------------------+ | 'ハハ' = 'パパ' COLLATE utf8mb4_general_ci | +------------------------------------------------+ | 0 | +------------------------------------------------+ 1 row in set (0.00 sec) mysql80> SELECT 'ハハ' = 'パパ' COLLATE utf8mb4_unicode_520_ci; +----------------------------------------------------+ | 'ハハ' = 'パパ' COLLATE utf8mb4_unicode_520_ci | +----------------------------------------------------+ | 1 | +----------------------------------------------------+ 1 row in set (0.00 sec)
今後どうなるのか(特に 暗黙のデフォルトが 何になるのか)注目しておいた方がよさげ。
0 件のコメント :
コメントを投稿