etcd 机器故障恢复

etcd 机器故障恢复

场景:如果一台etcd的机器挂掉时,需要新增一台etcd的机器时

以前的集群:

1
2
3
4
$ ETCDCTL_API=3  etcdctl member list --endpoints=ip1:12379
50a54bbf6cce9b8a, started, infra1, http://ip1:12380, http://ip1:12379
746056696c46311f, started, infra2, http://1ip2:12380, http://1ip2:12379
e800fc084ec7c5ef, started, infra3, http://ip3:12380, http://ip3:12379

假如ip1挂掉了,这时候新加一台机器 ip4

1、endpoints 指定一台正常节点,修改问题节点id至新的ip,只修改12380端口即可

1
ETCDCTL_API=3  etcdctl --endpoints=1ip2:12379   member update 50a54bbf6cce9b8a  --peer-urls="http://ip4:12380"

2、因为是静态配置,修改另外两台etcd机器上etcd_install2.sh 脚本,把原来的ip1的ip替换为ip4的ip

重新启动另外两台上的etcd。

3、启动新节点上的etcd,注意要把etcd_install2.sh –initial-cluster-state new new 改为existing

查看

1
2
3
4
5
6
7
8
9
$ ETCDCTL_API=3  etcdctl member list --endpoints=1ip2:12379
50a54bbf6cce9b8a, started, infra1, http://ip4:12380, http://ip4:12379
746056696c46311f, started, infra2, http://1ip2:12380, http://1ip2:12379
e800fc084ec7c5ef, started, infra3, http://ip3:12380, http://ip3:12379


$ ETCDCTL_API=3 etcdctl endpoint health --endpoints=ip4:12379
ip4:12379 is healthy: successfully committed proposal: took = 1.836639ms

新启动的节点会同步另外两台上的数据

4 启动scan_monitor,域名切换

后来又尝试一种方式
可以先删除节点再添加节点,这样其余两台etcd不用重启

1
2
3
ETCDCTL_API=3  etcdctl --endpoints=ip1:12379   member remove 50a54bbf6cce9b8a  

ETCDCTL_API=3 etcdctl --endpoints=ip1:12379 member add infra1 --peer-urls="http://ip4:12380"