2015年3月10日火曜日

MySQL 5.7.6のInnoDB日本語全文検索 ngram

日々の覚書: プレビュー "MySQL 5.7.6のInnoDB日本語全文検索 MeCab Plugin" に引き続き、今度はngramなトークナイザー。ドキュメントは ここ

MeCab Pluginと違って特にINSTALL PLUGINとかせずに即使える。


mysql> ALTER TABLE articles ADD FULLTEXT KEY (title, content) WITH PARSER ngram;
Query OK, 0 rows affected (6 hours 25 min 20.67 sec)
Records: 0  Duplicates: 0  Warnings: 0

( д ) ゚ ゚ 6時間半!? しょうがないとはいえ、MeCabのときは1時間半でお釣りが来てたのにと思うとちょっとしょんぼり。

時間もさることながら、このやり方(データをインポートしてからALTER TABLEでインデックス追加)だと容量がかなり持っていかれる。


[root@v157-7-235-254 mysql]# while true ; do
> date ; df -h
> echo ""
> sleep 60
> done
Tue Mar 10 10:13:41 JST 2015
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup-lv_root
                       97G   16G   76G  18% /
tmpfs                 499M     0  499M   0% /dev/shm
/dev/vda1             485M   32M  428M   7% /boot
..
Tue Mar 10 16:12:45 JST 2015
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup-lv_root
                       97G   74G   18G  81% /
tmpfs                 499M     0  499M   0% /dev/shm
/dev/vda1             485M   32M  428M   7% /boot
..
Tue Mar 10 16:44:51 JST 2015
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup-lv_root
                       97G   26G   66G  28% /
tmpfs                 499M     0  499M   0% /dev/shm
/dev/vda1             485M   32M  428M   7% /boot


# lsof -p `pidof mysqld`
COMMAND  PID  USER   FD   TYPE             DEVICE    SIZE/OFF    NODE NAME
mysqld  3049 mysql  cwd    DIR              253,0        4096  919114 /usr/local/mysql/data
mysqld  3049 mysql  rtd    DIR              253,0        4096       2 /
mysqld  3049 mysql  txt    REG              253,0   193624979  919090 /usr/local/mysql/bin/mysqld
mysqld  3049 mysql  mem    REG              253,0        5624  914553 /lib64/libaio.so.1.0.1
mysqld  3049 mysql  DEL    REG              253,0              914859 /lib64/libfreebl3.so
mysqld  3049 mysql  mem    REG              253,0    49199027 2231548 /usr/local/mysql/lib/mecab/dic/ipadic_utf-8/sys.dic
mysqld  3049 mysql  mem    REG              253,0     3463716 2231541 /usr/local/mysql/lib/mecab/dic/ipadic_utf-8/matrix.bin
mysqld  3049 mysql  mem    REG              253,0      262496 2231546 /usr/local/mysql/lib/mecab/dic/ipadic_utf-8/char.bin
mysqld  3049 mysql  mem    REG              253,0     3988451 2231491 /usr/local/mysql/lib/plugin/libpluginmecab.so
mysqld  3049 mysql  mem    REG              253,0       65928  914006 /lib64/libnss_files-2.12.so
mysqld  3049 mysql  DEL    REG              253,0              913936 /lib64/libc-2.12.so
mysqld  3049 mysql  DEL    REG              253,0              915115 /lib64/libgcc_s-4.4.7-20120601.so.1.#prelink#.JjQM5q
mysqld  3049 mysql  DEL    REG              253,0              913990 /lib64/libm-2.12.so
mysqld  3049 mysql  DEL    REG              253,0              263139 /usr/lib64/libstdc++.so.6.0.13.#prelink#.htv5XJ
mysqld  3049 mysql  DEL    REG              253,0              914081 /lib64/librt-2.12.so
mysqld  3049 mysql  DEL    REG              253,0              913976 /lib64/libdl-2.12.so
mysqld  3049 mysql  DEL    REG              253,0              913940 /lib64/libcrypt-2.12.so.#prelink#.3QGWZ8
mysqld  3049 mysql  DEL    REG              253,0              913960 /lib64/libpthread-2.12.so.#prelink#.xdy3QH
mysqld  3049 mysql  DEL    REG              253,0              913930 /lib64/ld-2.12.so
mysqld  3049 mysql  mem    REG              253,0        5690 2231545 /usr/local/mysql/lib/mecab/dic/ipadic_utf-8/unk.dic
mysqld  3049 mysql    0r   CHR                1,3         0t0    3788 /dev/null
mysqld  3049 mysql    1w   REG              253,0       15868  919127 /usr/local/mysql/data/error.log
mysqld  3049 mysql    2w   REG              253,0       15868  919127 /usr/local/mysql/data/error.log
mysqld  3049 mysql    3uW  REG              253,0    79691776  919158 /usr/local/mysql/data/ibdata1
..
mysqld  3049 mysql   45uW  REG              253,0 11286872064 1305615 /usr/local/mysql/data/wikipedia/articles.ibd
mysqld  3049 mysql   46uW  REG              253,0       98304 1305604 /usr/local/mysql/data/wikipedia/FTS_00000000000000f7_000000000000013a_INDEX_1.ibd
mysqld  3049 mysql   47uW  REG              253,0       98304 1305607 /usr/local/mysql/data/wikipedia/FTS_00000000000000f7_000000000000013a_INDEX_2.ibd
mysqld  3049 mysql   48uW  REG              253,0       98304 1305611 /usr/local/mysql/data/wikipedia/FTS_00000000000000f7_000000000000013a_INDEX_3.ibd
mysqld  3049 mysql   49uW  REG              253,0       98304 1305613 /usr/local/mysql/data/wikipedia/FTS_00000000000000f7_000000000000013a_INDEX_4.ibd
mysqld  3049 mysql   50uW  REG              253,0       98304 1305614 /usr/local/mysql/data/wikipedia/FTS_00000000000000f7_000000000000013a_INDEX_5.ibd
mysqld  3049 mysql   51uW  REG              253,0       98304 1305616 /usr/local/mysql/data/wikipedia/FTS_00000000000000f7_000000000000013a_INDEX_6.ibd
mysqld  3049 mysql   52uW  REG              253,0       98304 1305603 /usr/local/mysql/data/wikipedia/FTS_00000000000000f7_DELETED.ibd
mysqld  3049 mysql   53uW  REG              253,0       98304 1305605 /usr/local/mysql/data/wikipedia/FTS_00000000000000f7_DELETED_CACHE.ibd
mysqld  3049 mysql   54uW  REG              253,0       98304 1305606 /usr/local/mysql/data/wikipedia/FTS_00000000000000f7_BEING_DELETED.ibd
mysqld  3049 mysql   55uW  REG              253,0       98304 1305610 /usr/local/mysql/data/wikipedia/FTS_00000000000000f7_BEING_DELETED_CACHE.ibd
mysqld  3049 mysql   56uW  REG              253,0       98304 1305612 /usr/local/mysql/data/wikipedia/FTS_00000000000000f7_CONFIG.ibd
mysqld  3049 mysql   58u   REG              253,0 10842275840 1046745 /data/tmp/ibDDeVbj (deleted)
mysqld  3049 mysql   59u   REG              253,0   696254464 1046746 /data/tmp/ibaNLTUS (deleted)
mysqld  3049 mysql   60u   REG              253,0  1790967808 1046747 /data/tmp/ibNm7Vmi (deleted)
mysqld  3049 mysql   61u   REG              253,0  1623195648 1046748 /data/tmp/ibl17OJP (deleted)
mysqld  3049 mysql   62u   REG              253,0  8006926336 1046749 /data/tmp/ibBoSvqQ (deleted)
mysqld  3049 mysql   63u   REG              253,0  8006926336 1046750 /data/tmp/ibudnROb (deleted)
mysqld  3049 mysql   64u   REG              253,0 10904141824 1046751 /data/tmp/ibNbFRxL (deleted)
mysqld  3049 mysql   65u   REG              253,0   698351616 1046752 /data/tmp/ibYchSgl (deleted)
mysqld  3049 mysql   66u   REG              253,0  1802502144 1046753 /data/tmp/ibYYJGHW (deleted)
mysqld  3049 mysql   67u   REG              253,0  1631584256 1046754 /data/tmp/ibQOsAMA (deleted)
mysqld  3049 mysql   68u   REG              253,0  8050966528 1046755 /data/tmp/ibiSAO6i (deleted)
mysqld  3049 mysql   69u   REG              253,0  8050966528 1046756 /data/tmp/ibGrGYaE (deleted)
mysqld  3049 mysql   70u   REG              253,0   361758720 1046757 /data/tmp/ibQvfudp (deleted)
mysqld  3049 mysql   71u   REG              253,0   359661568 1046758 /data/tmp/ibN0BTAJ (deleted)

最終的に元に戻るとはいえ、tmpdirに数GB単位のテンポラリーファイルをいくつもつくられてしかもそれが途中で解放されないのは結構つらい。。


まあそこはそれとして


mysql> CREATE TABLE t1 (val varchar(32));
Query OK, 0 rows affected (0.26 sec)

mysql> INSERT INTO t1 VALUES ('にっこにっこにー');
Query OK, 1 row affected (0.08 sec)

mysql> SELECT * FROM t1;
+--------------------------+
| val                      |
+--------------------------+
| にっこにっこにー         |
+--------------------------+
1 row in set (0.00 sec)

こんなサンプルがあるじゃろ?


mysql> SELECT @@ngram_token_size;
+--------------------+
| @@ngram_token_size |
+--------------------+
|                  2 |
+--------------------+
1 row in set (0.00 sec)

mysql> ALTER TABLE t1 ADD FULLTEXT KEY idx_2gram(val) WITH PARSER ngram;
Query OK, 0 rows affected (0.44 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> SET GLOBAL innodb_ft_aux_table = 'd1/t1';
Query OK, 0 rows affected (0.00 sec)

mysql> OPTIMIZE TABLE t1;
+-------+----------+----------+-------------------------------------------------------------------+
| Table | Op       | Msg_type | Msg_text                                                          |
+-------+----------+----------+-------------------------------------------------------------------+
| d1.t1 | optimize | note     | Table does not support optimize, doing recreate + analyze instead |
| d1.t1 | optimize | status   | OK                                                                |
+-------+----------+----------+-------------------------------------------------------------------+
2 rows in set (0.36 sec)

mysql> SELECT * FROM information_schema.innodb_ft_index_cache ORDER BY position;
+--------+--------------+-------------+-----------+--------+----------+
| WORD   | FIRST_DOC_ID | LAST_DOC_ID | DOC_COUNT | DOC_ID | POSITION |
+--------+--------------+-------------+-----------+--------+----------+
| にっ   |            2 |           2 |         1 |      2 |        0 |
| っこ   |            2 |           2 |         1 |      2 |        3 |
| こに   |            2 |           2 |         1 |      2 |        6 |
| っこ   |            2 |           2 |         1 |      2 |        9 |
| にっ   |            2 |           2 |         1 |      2 |        9 |
| こに   |            2 |           2 |         1 |      2 |        9 |
| にー   |            2 |           2 |         1 |      2 |       18 |
+--------+--------------+-------------+-----------+--------+----------+
7 rows in set (0.02 sec)

mysql> SELECT * FROM information_schema.innodb_ft_config;
+---------------------------+-------+
| KEY                       | VALUE |
+---------------------------+-------+
| optimize_checkpoint_limit | 180   |
| synced_doc_id             | 0     |
| stopword_table_name       |       |
| use_stopword              | 1     |
+---------------------------+-------+
4 rows in set (0.03 sec)

ngram_token_size変数でトークンのサイズを制御できる。この変数はオンライン変更できないので、my.cnfに設定してからmysqld再起動。


mysql> SET GLOBAL innodb_ft_aux_table = 'd1/t1';
Query OK, 0 rows affected (0.00 sec)

mysql> OPTIMIZE TABLE t1;
+-------+----------+----------+-------------------------------------------------------------------+
| Table | Op       | Msg_type | Msg_text                                                          |
+-------+----------+----------+-------------------------------------------------------------------+
| d1.t1 | optimize | note     | Table does not support optimize, doing recreate + analyze instead |
| d1.t1 | optimize | status   | OK                                                                |
+-------+----------+----------+-------------------------------------------------------------------+
2 rows in set (0.31 sec)

mysql> SELECT * FROM information_schema.innodb_ft_index_cache ORDER BY position;
+-----------+--------------+-------------+-----------+--------+----------+
| WORD      | FIRST_DOC_ID | LAST_DOC_ID | DOC_COUNT | DOC_ID | POSITION |
+-----------+--------------+-------------+-----------+--------+----------+
| にっこ    |            2 |           2 |         1 |      2 |        0 |
| っこに    |            2 |           2 |         1 |      2 |        3 |
| こにっ    |            2 |           2 |         1 |      2 |        6 |
| にっこ    |            2 |           2 |         1 |      2 |        9 |
| っこに    |            2 |           2 |         1 |      2 |        9 |
| こにー    |            2 |           2 |         1 |      2 |       15 |
+-----------+--------------+-------------+-----------+--------+----------+
6 rows in set (0.00 sec)

mysql> SELECT * FROM information_schema.innodb_ft_config;
+---------------------------+-------+
| KEY                       | VALUE |
+---------------------------+-------+
| optimize_checkpoint_limit | 180   |
| synced_doc_id             | 0     |
| stopword_table_name       |       |
| use_stopword              | 1     |
+---------------------------+-------+
4 rows in set (0.01 sec)

あ、あれ? 再作成しなくても切りなおされるの?;

0 件のコメント :

コメントを投稿