Replacing a Failed Control Plane Node in a HA Kubernetes Cluster

We have a failed control plane node in our highly available multi-master Kubernetes cluster that we need to replace.

Before We Begin

We are using our Kubernetes homelab in this article.

We have a failed control plane srv31 that needs removing from the cluster and replacing with a new node.

Pre-check validation:

$ kubectl get no
NAME    STATUS    ROLES           AGE    VERSION
srv31   NotReady  control-plane   375d   v1.26.4
srv32   Ready     control-plane   327d   v1.26.4
srv33   Ready     control-plane   456d   v1.26.4
srv34   Ready     none            456d   v1.26.4
srv35   Ready     none            327d   v1.26.4
srv36   Ready     none            456d   v1.26.4

Also, we are going to use the ETDC client, if you don’t have it installed, download it using commands below:

$ ETCD_VER=v3.5.9
$ GITHUB_URL=https://github.com/etcd-io/etcd/releases/download
$ DOWNLOAD_URL=${GITHUB_URL}
$ mkdir -p /tmp/etcd-download-test
$ curl -fsSL ${DOWNLOAD_URL}/${ETCD_VER}/etcd-${ETCD_VER}-linux-amd64.tar.gz -o /tmp/etcd-${ETCD_VER}-linux-amd64.tar.gz
$ tar xzvf /tmp/etcd-${ETCD_VER}-linux-amd64.tar.gz -C /tmp/etcd-download-test --strip-components=1
$ rm -f /tmp/etcd-${ETCD_VER}-linux-amd64.tar.gz
$ sudo cp /tmp/etcd-download-test/etcdctl /usr/local/bin/
$ etcdctl version

Remove an Unhealthy ETCD Member

Check ETCD member status on a working control plane:

$ ETCDCTL_API=3 etcdctl \
  --endpoints 127.0.0.1:2379 \
  --cacert /etc/kubernetes/pki/etcd/ca.crt \
  --cert /etc/kubernetes/pki/etcd/server.crt \
  --key /etc/kubernetes/pki/etcd/server.key \
  member list
c36952e9f5bf4f49, started, srv33, https://10.11.1.33:2380, https://10.11.1.33:2379, false
df4ce5503d32478a, started, srv31, https://10.11.1.31:2380, https://10.11.1.31:2379, false
e279a8288f4be237, started, srv32, https://10.11.1.32:2380, https://10.11.1.32:2379, false

Remove ETCD srv31 member ID:

$ ETCDCTL_API=3 etcdctl \
  --endpoints 127.0.0.1:2379 \
  --cacert /etc/kubernetes/pki/etcd/ca.crt \
  --cert /etc/kubernetes/pki/etcd/server.crt \
  --key /etc/kubernetes/pki/etcd/server.key \
  member remove df4ce5503d32478a
Member df4ce5503d32478a removed from cluster 53e3f96426ba03f3

Check ETCD member status again and make sure that ETCD member srv31 is no longer shown on the status:

$ ETCDCTL_API=3 etcdctl \
  --endpoints 127.0.0.1:2379 \
  --cacert /etc/kubernetes/pki/etcd/ca.crt \
  --cert /etc/kubernetes/pki/etcd/server.crt \
  --key /etc/kubernetes/pki/etcd/server.key \
  member list
c36952e9f5bf4f49, started, srv33, https://10.11.1.33:2380, https://10.11.1.33:2379, false
e279a8288f4be237, started, srv32, https://10.11.1.32:2380, https://10.11.1.32:2379, false

Replace Failed Control Plane

Drain and delete the failed control plane srv31:

$ kubectl drain srv31
$ kubectl delete node srv31

Now we are ready to add a new control node.

Use your deployment pipeline (Ansible/Packer/Terraform/etc) to replace the broken control plane server with a new one.

To create a new certificate key, use the following command and run it on a working control plane (either srv32 or srv33):

$ sudo kubeadm init phase upload-certs --upload-certs
ce34e277ab5b795e8b559d1aa8b2d243fd284acb193fb490b26ee9a695d0ccfe

Print the full kubeadm join flag needed to join the cluster as a control-plane (on a working control plane). Use the certificate key from above.

$ sudo kubeadm token create --print-join-command --certificate-key ce34e277ab5b795e8b559d1aa8b2d243fd284acb193fb490b26ee9a695d0ccfe
kubeadm join kubelb.hl.test:6443 --token sqsh63.jw2p7kq6cy0cm7u5 --discovery-token-ca-cert-hash sha256:e98d5740c0ff6d5fd567cba755e27ea57fcc06fd694436a90ad632813351aae1 --control-plane --certificate-key ce34e277ab5b795e8b559d1aa8b2d243fd284acb193fb490b26ee9a695d0ccfe

Join the new control plane srv31 to the cluster:

$ sudo kubeadm join kubelb.hl.test:6443 \
  --token sqsh63.jw2p7kq6cy0cm7u5 \
  --discovery-token-ca-cert-hash sha256:e98d5740c0ff6d5fd567cba755e27ea57fcc06fd694436a90ad632813351aae1 \
  --control-plane \
  --certificate-key ce34e277ab5b795e8b559d1aa8b2d243fd284acb193fb490b26ee9a695d0ccfe

Verify:

$ kubectl get no
NAME    STATUS   ROLES           AGE    VERSION
srv31   Ready    control-plane   84s    v1.26.4
srv32   Ready    control-plane   327d   v1.26.4
srv33   Ready    control-plane   456d   v1.26.4
srv34   Ready    none            456d   v1.26.4
srv35   Ready    none            327d   v1.26.4
srv36   Ready    none            456d   v1.26.4

Check ETCD membership:

$ ETCDCTL_API=3 etcdctl \
  --endpoints 127.0.0.1:2379 \
  --cacert /etc/kubernetes/pki/etcd/ca.crt \
  --cert /etc/kubernetes/pki/etcd/server.crt \
  --key /etc/kubernetes/pki/etcd/server.key \
  member list
c36952e9f5bf4f49, started, srv33, https://10.11.1.33:2380, https://10.11.1.33:2379, false
c44657d8f6e7dea5, started, srv31, https://10.11.1.31:2380, https://10.11.1.31:2379, false
e279a8288f4be237, started, srv32, https://10.11.1.32:2380, https://10.11.1.32:2379, false

Leave a Reply

Your email address will not be published. Required fields are marked *