在一个3个节点的ETCD集群中,有两个节点因主机断电,意外结束后无法正常启动,日志中抛以下错误:
2021-05-16 18:01:14.414073 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2021-05-16 18:01:14.414095 I | embed: peerTLS: cert = ./certs/etcd-peer.pem, key = ./certs/etcd-peer-key.pem, ca = ./certs/ca.pem, trusted-ca = ./certs/ca.pem, client-cert-auth = true
2021-05-16 18:01:14.414712 I | embed: listening for peers on https://10.4.7.22:2380
2021-05-16 18:01:14.414776 W | embed: The scheme of client url http://127.0.0.1:2379 is HTTP while peer key/cert files are presented. Ignored key/cert files.
2021-05-16 18:01:14.414781 W | embed: The scheme of client url http://127.0.0.1:2379 is HTTP while client cert auth (--client-cert-auth) is enabled. Ignored client cert auth for this url.
2021-05-16 18:01:14.414805 I | embed: listening for client requests on 127.0.0.1:2379
2021-05-16 18:01:14.414822 I | embed: listening for client requests on 10.4.7.22:2379
panic: invalid freelist page: 229, page type is leaf
goroutine 118 [running]:
github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt.(*freelist).read(0xc42021de30, 0x7f76c5665000)
/tmp/etcd-release-3.1.20/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt/freelist.go:237 +0x35b
github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt.(*DB).loadFreelist.func1()
/tmp/etcd-release-3.1.20/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt/db.go:290 +0x1bf
sync.(*Once).Do(0xc420254150, 0xc420042dc8)
/usr/local/google/home/jpbetz/.gvm/gos/go1.8.7/src/sync/once.go:44 +0xbe
github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt.(*DB).loadFreelist(0xc420254000)
/tmp/etcd-release-3.1.20/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt/db.go:293 +0x57
github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt.Open(0xc42021dd10, 0x25, 0x180, 0x1363fa0, 0x0, 0x0, 0x0)
/tmp/etcd-release-3.1.20/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt/db.go:260 +0x3f6
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend.newBackend(0xc42021dd10, 0x25, 0x5f5e100, 0x2710, 0xd01060)
/tmp/etcd-release-3.1.20/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend/backend.go:112 +0x61
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend.NewDefaultBackend(0xc42021dd10, 0x25, 0xa72749, 0x432538)
/tmp/etcd-release-3.1.20/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend/backend.go:108 +0x4d
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.NewServer.func1(0xc420247020, 0xc42021dd10, 0x25, 0xc42006fda0)
/tmp/etcd-release-3.1.20/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/server.go:275 +0x39
created by github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.NewServer
/tmp/etcd-release-3.1.20/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/server.go:277 +0x4bc
从日志看应该是数据损坏了,此时集群中只剩下一个正常的节点,小于n/2,集群无法正常工作,执行etcdctl命令失败。
# etcdctl --endpoints=https://10.4.7.21:2379 --key=/opt/etcd/certs/etcd-peer-key.pem --cert=/opt/etcd/certs/etcd-peer.pem --cacert=/opt/etcd/certs/ca.pem member list
2021-05-16 18:16:50.011903 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated
Error: grpc: timed out when dialing
因为集群中正常的工作节点小于n/2,无法重新添加节点,唯一的办法就是用这个正常节点上的数据重新起一个ETCD集群,再把另外两个节点逐个加进来。
为方便理解,我们把集群中3给节点简称为A,B,C,这个时候,A是正常的。
恢复步骤如下:
在A上:
停止etcd
修改启动参数:
--initial-cluster etcd-server-7-21=https://10.4.7.21:2380 \
--force-new-cluster \
启动etcd
此时A上启动了一个单节点的etcd集群, 数据还是原来的数据
把B加入集群:
export PATH=$PATH:/opt/etcd
export ETCDCTL_API=3
etcdctl --endpoints=https://10.4.7.21:2379 --key=/opt/etcd/certs/etcd-peer-key.pem --cert=/opt/etcd/certs/etcd-peer.pem --cacert=/opt/etcd/certs/ca.pem member add etcd-server-7-22 --peer-urls=https://10.4.7.22:2380
etcdctl --endpoints=https://10.4.7.21:2379 --key=/opt/etcd/certs/etcd-peer-key.pem --cert=/opt/etcd/certs/etcd-peer.pem --cacert=/opt/etcd/certs/ca.pem member list
在B上:
修改启动参数:
--initial-cluster etcd-server-7-21=https://10.4.7.21:2380,etcd-server-7-22=https://10.4.7.22:2380 \
--initial-cluster-state existing \
重命名原数据目录/data/etcd/etcd-server/member
启动etcd
此时集群是一个2节点集群,B会从A同步数据
把C加入集群:
etcdctl --endpoints=https://10.4.7.21:2379 --key=/opt/etcd/certs/etcd-peer-key.pem --cert=/opt/etcd/certs/etcd-peer.pem --cacert=/opt/etcd/certs/ca.pem member add etcd-server-7-12 --peer-urls=https://10.4.7.12:2380
etcdctl --endpoints=https://10.4.7.21:2379 --key=/opt/etcd/certs/etcd-peer-key.pem --cert=/opt/etcd/certs/etcd-peer.pem --cacert=/opt/etcd/certs/ca.pem member list
在C上:
修改启动参数:
--initial-cluster-state existing \
重命名原数据目录/data/etcd/etcd-server/member
启动etcd
此时集群是一个3节点集群,C会从leader同步数据,集群恢复高可用
原文:https://www.cnblogs.com/yannwang/p/14855941.html