2019-07-30

etcd 机器故障恢复

场景：如果一台etcd的机器挂掉时，需要新增一台etcd的机器时

以前的集群：

$ ETCDCTL_API=3  etcdctl member list --endpoints=ip1:12379
50a54bbf6cce9b8a, started, infra1, http://ip1:12380, http://ip1:12379
746056696c46311f, started, infra2, http://1ip2:12380, http://1ip2:12379
e800fc084ec7c5ef, started, infra3, http://ip3:12380, http://ip3:12379

假如ip1挂掉了，这时候新加一台机器 ip4

1、endpoints 指定一台正常节点，修改问题节点id至新的ip，只修改12380端口即可

1	ETCDCTL_API=3 etcdctl --endpoints=1ip2:12379 member update 50a54bbf6cce9b8a --peer-urls="http://ip4:12380"

2、因为是静态配置，修改另外两台etcd机器上etcd_install2.sh 脚本，把原来的ip1的ip替换为ip4的ip

重新启动另外两台上的etcd。

3、启动新节点上的etcd，注意要把etcd_install2.sh –initial-cluster-state new new 改为existing

查看

$ ETCDCTL_API=3  etcdctl member list --endpoints=1ip2:12379
50a54bbf6cce9b8a, started, infra1, http://ip4:12380, http://ip4:12379
746056696c46311f, started, infra2, http://1ip2:12380, http://1ip2:12379
e800fc084ec7c5ef, started, infra3, http://ip3:12380, http://ip3:12379


$ ETCDCTL_API=3  etcdctl endpoint health  --endpoints=ip4:12379
ip4:12379 is healthy: successfully committed proposal: took = 1.836639ms

新启动的节点会同步另外两台上的数据

4 启动scan_monitor，域名切换

后来又尝试一种方式
可以先删除节点再添加节点，这样其余两台etcd不用重启

1
2
3

ETCDCTL_API=3  etcdctl --endpoints=ip1:12379   member remove 50a54bbf6cce9b8a  

ETCDCTL_API=3  etcdctl --endpoints=ip1:12379   member add infra1  --peer-urls="http://ip4:12380"