Configure Kubelet Out of Resource Handling, or How to Stop EKS Kubernetes Nodes from Going Down

Kubelet has the ability to specify eviction thresholds that trigger the kubelet to reclaim resources. We are going to use them with our EKS worker nodes.

EKS Worker Nodes and Bootstrap Script

We use the EKS-optimised Linux AMI that comes packaged with the /etc/eks/bootstrap.sh script for registering worker nodes to the EKS cluster.

We use Terraform to return the AMI ID for EKS worker node:

data "aws_ssm_parameter" "worker_ami" {
  name = "/aws/service/eks/optimized-ami/${var.eks_version}/amazon-linux-2/recommended/image_id"
}

Kuberneters Resource Management

Based on experience, unless resources are set aside for system daemons, Kubernetes pods and system daemons compete for resources and eventually lead to resource starvation issues on EKS worker nodes.

kube-reserved

The kube-reserved can capture resource reservation for kubernetes system daemons like the kubelet.

We want to set the following:

--kube-reserved memory=0.3Gi
--kube-reserved ephemeral-storage=1Gi

system-reserved

The system-reserved can capture resource reservation for OS system daemons like udev etc.

We want to set the following:

--system-reserved memory=0.3Gi
--system-reserved ephemeral-storage=1Gi

Eviction Thresholds

To avoid system going out of memory kubelet provides out-of-resource management. Currently evictions are supported for memory and ephemeral-storage only.

We want to set the following:

--eviction-hard memory.available<200Mi
--eviction-hard nodefs.available<10%

EKS Worker Node User Data

We can use EC2 instance User Data to set bootstrap parameters when creating EKS worker nodes with Terraform:

#!/bin/bash -xe
/etc/eks/bootstrap.sh \
    --kubelet-extra-args "--kube-reserved memory=0.3Gi,ephemeral-storage=1Gi --system-reserved memory=0.3Gi,ephemeral-storage=1Gi --eviction-hard memory.available<200Mi,nodefs.available<10%" \
    ${ClusterName}

This should enforce out of resource handling.

Referenfes

https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/

Leave a Reply

Your email address will not be published.