2020/03/03

壊したマルチプライマリーモード、その後に

TL;DR

  • グループレプリケーションが他のノードから受け取ったクエリーは hostname-relay-bin-group_replication_applier* というリレーログに保管される
    • グループレプリケーションといえどリレーログなので再起動とかで勝手に消えはしない
      • RESET SLAVE で消せる。
      • 【2020/03/09 16:34】relay_log_recovery = ON でも消せる
  • グループレプリケーションの復旧はかなり面倒なので、一回クラスター破棄して作り直した方が楽かも

日々の覚書: MultiPrimaryModeのGroup Replication環境を崩壊させるテスト で壊したあとの残骸を眺めて色々考えるメモ

### node1 (最初に ALTER TABLE を開始
$ mysql -e "SHOW MASTER STATUS"
+---------------+-----------+--------------+------------------+---------------------------------------------+
| File          | Position  | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                           |
+---------------+-----------+--------------+------------------+---------------------------------------------+
| binlog.000004 | 834138015 |              |                  | 5fa73b33-5d09-11ea-b37d-1272d9b5f033:1-3800 |
+---------------+-----------+--------------+------------------+---------------------------------------------+

### node2 (node1のALTER TABLE中にALTER TABLEを開始
$ mysql -e "SHOW MASTER STATUS"
+---------------+-----------+--------------+------------------+-----------------------------------------------------+
| File          | Position  | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                                   |
+---------------+-----------+--------------+------------------+-----------------------------------------------------+
| binlog.000003 | 834129853 |              |                  | 5fa73b33-5d09-11ea-b37d-1272d9b5f033:1-3799:1000049 |
+---------------+-----------+--------------+------------------+-----------------------------------------------------+

### node3 (node1のALTER TABLEをグループレプリケーションで受け取ってからnode2のを受信
$ mysql -e "SHOW MASTER STATUS"
+---------------+-----------+--------------+------------------+---------------------------------------------+
| File          | Position  | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                           |
+---------------+-----------+--------------+------------------+---------------------------------------------+
| binlog.000003 | 834129853 |              |                  | 5fa73b33-5d09-11ea-b37d-1272d9b5f033:1-3800 |
+---------------+-----------+--------------+------------------+---------------------------------------------+
node1由来の 5fa73b33-5d09-11ea-b37d-1272d9b5f033:3800 と node2由来の 5fa73b33-5d09-11ea-b37d-1272d9b5f033:1000049 がバッティングするので、node1とnode3では :3380 の方、 node2では :1000049 の方だけがgtid_executedに入っている。
なあんだ、フツーに壊れた非同期レプリケーションと一緒じゃん? (フラグ)
node1とnode3を活かしてnode2を捨てて再構築すればいいんじゃん? (フラグ)
### node2
$ sudo systemctl stop mysqld

### node1
$ mysqlsh gradmin@localhost
 MySQL  localhost:33060+ ssl  JS > dba.rebootClusterFromCompleteOutage()
Dba.rebootClusterFromCompleteOutage: This function is not available through a session to an instance in error state (RuntimeError)
あれ…。 rebootClusterFromCompleteOutage って単に group_replication_bootstrap_group = ON にして START GROUP_REPLICATION してるだけかと思ったんだけどそれ以外にも何か見てるのかしらん、とエラーログを見に行く。
### node1
$ less /var/log/mysqld.log
..
2020-03-03T05:04:30.110909Z 16 [ERROR] [MY-010584] [Repl] Slave SQL for channel 'group_replication_applier': Error 'Duplicate key name 'idx_pad'' on query. Default database: 'sbtest'. Query
: 'ALTER TABLE sbtest1 ADD KEY idx_pad(pad)', Error_code: MY-001061
2020-03-03T05:04:30.112022Z 16 [Warning] [MY-010584] [Repl] Slave: Duplicate key name 'idx_pad' Error_code: MY-001061
2020-03-03T05:04:30.116461Z 16 [ERROR] [MY-011451] [Repl] Plugin group_replication reported: 'The applier thread execution was aborted. Unable to process more transactions, this member will now leave the group.'
2020-03-03T05:04:30.116528Z 16 [ERROR] [MY-010586] [Repl] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'FIRST' position 0
2020-03-03T05:04:30.116710Z 13 [ERROR] [MY-011452] [Repl] Plugin group_replication reported: 'Fatal error during execution on the Applier process of Group Replication. The server will now leave the group.'
2020-03-03T05:04:30.116811Z 13 [ERROR] [MY-011712] [Repl] Plugin group_replication reported: 'The server was automatically set into read only mode after an error was detected.'
エラー自体は非同期レプリケーションが壊れた時でも良く見るやつ。node2由来の :1000049 を再実行しようとして転けている…が、このnode2由来のものが「どこに残っていて」「どうやって消すのか」
取り敢えずnode1もmysqldを再起動してみましょうかね。
### node1
$ sudo systemctl restart mysqld

$ sudo tail -f /var/log/mysqld.log
..
2020-03-03T05:40:42.619773Z 2 [ERROR] [MY-011640] [Repl] Plugin group_replication reported: 'Timeout on wait for view after joining group'
2020-03-03T05:40:42.619875Z 2 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] The member is leaving a group without being on one.'
起動後のグループレプリケーション再開が諦めてくれるまで待つ(それか group_replication_start_on_boot = OFF でも良かったか)
### node1
$ mysqlsh gradmin@localhost
 MySQL  localhost:33060+ ssl  JS > dba.rebootClusterFromCompleteOutage()
Reconfiguring the default cluster from complete outage...

Dba.rebootClusterFromCompleteOutage: Target member is in state ERROR (RuntimeError)
エラーメッセージは変わったもののまたエラー。
今度はエラーログ何も吐かずにエラー。
### node1
mysql> SET GLOBAL group_replication_bootstrap_group = 'ON';
Query OK, 0 rows affected (0.00 sec)

mysql> START GROUP_REPLICATION;
ERROR 3092 (HY000): The server is not configured properly to be an active member of the group. Please see more details on error log.
手で START GROUP_REPLICATION してもダメ。エラーログ見ろって書いてあるから見てみると、まだ結局node2由来の ALTER TABLE がどこかに残っていて引っ掛かる。
2020-03-03T05:52:07.431805Z 27 [System] [MY-010597] [Repl] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_applier' executed'. Previous state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''.
2020-03-03T05:52:07.443379Z 30 [ERROR] [MY-010584] [Repl] Slave SQL for channel 'group_replication_applier': Error 'Duplicate key name 'idx_pad'' on query. Default database: 'sbtest'. Query: 'ALTER TABLE sbtest1 ADD KEY idx_pad(pad)', Error_code: MY-001061
2020-03-03T05:52:07.443412Z 30 [Warning] [MY-010584] [Repl] Slave: Duplicate key name 'idx_pad' Error_code: MY-001061
2020-03-03T05:52:07.443424Z 30 [ERROR] [MY-011451] [Repl] Plugin group_replication reported: 'The applier thread execution was aborted. Unable to process more transactions, this member will now leave the group.'
2020-03-03T05:52:07.443436Z 30 [ERROR] [MY-010586] [Repl] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'FIRST' position 0
2020-03-03T05:52:07.445214Z 27 [ERROR] [MY-011452] [Repl] Plugin group_replication reported: 'Fatal error during execution on the Applier process of Group Replication. The server will now leave the group.'
2020-03-03T05:52:07.445261Z 27 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] The member is already leaving or joining a group.'
2020-03-03T05:52:07.445279Z 27 [ERROR] [MY-011644] [Repl] Plugin group_replication reported: 'Unable to confirm whether the server has left the group or not. Check performance_schema.replication_group_members to check group membership information.'
2020-03-03T05:52:07.445297Z 27 [ERROR] [MY-011712] [Repl] Plugin group_replication reported: 'The server was automatically set into read only mode after an error was detected.'
2020-03-03T05:52:09.590386Z 0 [ERROR] [MY-011502] [Repl] Plugin group_replication reported: 'There was a previous plugin error while the member joined the group. The member will now exit the group.'
CHANNEL 'group_replication_applier' 的なのが何か所に入っていて、「フツーのマルチソースレプリケーションだったらそんな名前のリレーログが出来上がるよな」と思っていたらビンゴだった。
### node1
$ ll /var/lib/mysql/*applier*
-rw-r-----. 1 mysql mysql 1074111468 Mar  3 04:51 /var/lib/mysql/node1-relay-bin-group_replication_applier.000002
-rw-r-----. 1 mysql mysql  834138348 Mar  3 05:04 /var/lib/mysql/node1-relay-bin-group_replication_applier.000003
-rw-r-----. 1 mysql mysql        122 Mar  3 04:51 /var/lib/mysql/node1-relay-bin-group_replication_applier.index

$ mysqlbinlog -vv /var/lib/mysql/node1-relay-bin-group_replication_applier.000003 | grep -C5 :1000049
# original_commit_timestamp=1583211870082250 (2020-03-03 05:04:30.082250 UTC)
# immediate_commit_timestamp=0 (1970-01-01 00:00:00.000000 UTC)
/*!80001 SET @@session.original_commit_timestamp=1583211870082250*//*!*/;
/*!80014 SET @@session.original_server_version=80019*//*!*/;
/*!80014 SET @@session.immediate_server_version=80019*//*!*/;
SET @@SESSION.GTID_NEXT= '5fa73b33-5d09-11ea-b37d-1272d9b5f033:1000049'/*!*/;
# at 834138215
#200303  5:01:33 server id 163517600  end_log_pos 0     Query   thread_id=50    exec_time=177   error_code=0    Xid = 7665
SET TIMESTAMP=1583211693/*!*/;
SET @@session.auto_increment_increment=7, @@session.auto_increment_offset=2/*!*/;
/*!80013 SET @@session.sql_require_primary_key=0*//*!*/;
「フツーにリレーログ消すなら RESET SLAVE だよな?」と思ったら消えた。
### node1
mysql> RESET SLAVE;

$ ll /var/lib/mysql/*applier*
-rw-r-----. 1 mysql mysql 236 Mar  3 06:19 /var/lib/mysql/node1-relay-bin-group_replication_applier.000001
-rw-r-----. 1 mysql mysql 628 Mar  3 06:19 /var/lib/mysql/node1-relay-bin-group_replication_applier.000002
-rw-r-----. 1 mysql mysql 122 Mar  3 06:19 /var/lib/mysql/node1-relay-bin-group_replication_applier.index
ただこれでも dba.rebootClusterFromCompleteOutage では戻せなかったので、手でグループレプリケーションを再起動。
### node1
mysql> STOP GROUP_REPLICATION;
Query OK, 0 rows affected (4.55 sec)

mysql> SET GLOBAL group_replication_bootstrap_group = 'ON';
Query OK, 0 rows affected (0.00 sec)

mysql> START GROUP_REPLICATION;
Query OK, 0 rows affected (3.13 sec)
これでようやく、 mysqlsh --cluster で認識できる状態になる。
node2は落っことしてるから良いとして、node3がMISSINGになってるな。
### node1
$ mysqlsh gradmin@localhost --cluster
 MySQL  localhost:33060+ ssl  JS > cluster.status()
{
    "clusterName": "myfabric",
    "defaultReplicaSet": {
        "name": "default",
        "ssl": "REQUIRED",
        "status": "OK_NO_TOLERANCE",
        "statusText": "Cluster is NOT tolerant to any failures. 2 members are not active",
        "topology": {
            "node2:3306": {
                "address": "node2:3306",
                "mode": "n/a",
                "readReplicas": {},
                "role": "HA",
                "shellConnectError": "MySQL Error 2003 (HY000): Can't connect to MySQL server on 'node2' (111)",
                "status": "(MISSING)"
            },
            "node1:3306": {
                "address": "node1:3306",
                "mode": "R/W",
                "readReplicas": {},
                "replicationLag": null,
                "role": "HA",
                "status": "ONLINE",
                "version": "8.0.19"
            },
            "node3:3306": {
                "address": "node3:3306",
                "mode": "R/O",
                "readReplicas": {},
                "role": "HA",
                "status": "(MISSING)"
            }
        },
        "topologyMode": "Multi-Primary"
    },
    "groupInformationSourceMember": "node1:3306"
}
テキトーに cluster.rejoinInstance でいいのかなと思ったら良いっぽい。
### node1
 MySQL  localhost:33060+ ssl  JS > cluster.rejoinInstance('gradmin@node3')
Rejoining the instance to the InnoDB cluster. Depending on the original
problem that made the instance unavailable, the rejoin operation might not be
successful and further manual steps will be needed to fix the underlying
problem.

Please monitor the output of the rejoin operation and take necessary action if
the instance cannot rejoin.

Rejoining instance to the cluster ...

Please provide the password for 'gradmin@node3': ******
Save password for 'gradmin@node3:3306'? [Y]es/[N]o/Ne[v]er (default No):
The instance 'node3' was successfully rejoined on the cluster.

 MySQL  localhost:33060+ ssl  JS > cluster.status()
{
    "clusterName": "myfabric",
    "defaultReplicaSet": {
        "name": "default",
        "ssl": "REQUIRED",
        "status": "OK_NO_TOLERANCE",
        "statusText": "Cluster is NOT tolerant to any failures. 1 member is not active",
        "topology": {
            "node2:3306": {
                "address": "node2:3306",
                "mode": "n/a",
                "readReplicas": {},
                "role": "HA",
                "shellConnectError": "MySQL Error 2003 (HY000): Can't connect to MySQL server on 'node2' (111)",
                "status": "(MISSING)"
            },
            "node1:3306": {
                "address": "node1:3306",
                "mode": "R/W",
                "readReplicas": {},
                "replicationLag": null,
                "role": "HA",
                "status": "ONLINE",
                "version": "8.0.19"
            },
            "node3:3306": {
                "address": "node3:3306",
                "mode": "R/W",
                "readReplicas": {},
                "replicationLag": "00:02:56.564037",
                "role": "HA",
                "status": "ONLINE",
                "version": "8.0.19"
            }
        },
        "topologyMode": "Multi-Primary"
    },
    "groupInformationSourceMember": "node1:3306"
}
とはいえこのあとnode2をクラスターから追放しようにもrejoinさせようにもいろいろ躓いたので、おとなしく cluster.dissolve して作り直した方が楽なんじゃないかな…

0 件のコメント :

コメントを投稿