Probes are health checks that are executed by kubelet.
We’ve been running dockerised Java applications on Kubernetes for a while now, all with readiness probes configured. Due to Blue/Green deployment the application would receive frequent no-downtime upgrades, and all pods would get redeployed. This would also clear any existing Java memory (and hide the problem we’re about to discuss).
We’ve started running stress tests against the application in order to push it to its limits, and the following happened.
A Java application inside a Docker container ran out of memory:
java.lang.OutOfMemoryError: Java heap space
The kubelet relies on readiness probes to determine whether the container is able to accept traffic. A pod is considered ready when all of its containers are ready. In our case when the Java container inside the pod fails its readiness checks, the pod is marked as not ready, and does not receive traffic through Kubernetes services.
The problem is that the Java container is still running, but the application is unable to serve traffic. A failing readiness probe will not restart a container.
The system should be capable of recovering from such state automatically.
Use liveness probes. A failing liveness probe will restart a container.
We are going to configure an HTTP probe to makes an HTTP call to the container, where the status code would determine the healthy state. Each application is obviously different and these must be configured per application.
livenessProbe: httpGet: path: /service/ping port: 8080 scheme: HTTP initialDelaySeconds: 60 timeoutSeconds: 5 periodSeconds: 20 successThreshold: 1 failureThreshold: 6
It is of crucial importance to understand that if for whatever reason a liveness probe fails for more than periodSeconds x failureThreshold, the container will be marked as unhealthy, and a restart of the pod will be triggered. This may cause a crash loop if the threshold values are not configured correctly.
Do note that liveness probes do not wait for readiness probes to succeed. We want to wait before executing a liveness probe, that’s why we define initialDelaySeconds.