Kubernetes Troubleshooting

▶What is Kubernetes Troubleshooting?

Kubernetes troubleshooting is the process of identifying, diagnosing, and resolving issues in Kubernetes clusters, nodes, pods, or containers.

Also includes effective ongoing management of faults and taking measures to prevent issues in Kubernetes components.

Kubernetes troubleshooting can be very complex. This article will focus on some of the common issues.

▶Get an OOMKilled error

Let's say that you have incorporated monitoring tools(Grafana or Prometheus) in your Kubernetes cluster, and you create a rule that identifies when pods become consistently unavailable. It sends a notification through an automated phone call or chat message informing you when your pods are not available.

If you run kubectl get pods and see that some pods are being restarted, the next thing to do is to check why. You can do this by using:

$ kubectl describe pod myPodName -n myNamespace

State:  Running
Started: Sun, 19 Feb 2023 10:20:09 +0000
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Sun, 19Feb 2022 17:00:39 +0000
Finished: Sun, 19Feb 2022 17;20:08 +0000
Restart Count: 7

OOMKilled means that the pod reached its memory limit, so it restarts. You can see the restart count when you run the describe command. The obvious solution is to increase the memory setting. This can be done by running kubectl edit deployment myDeployment -n mynamespace and editing the memory limit.

This might have occurred because of a memory leak due to a bug in your application. It's important to look at your logs to see whether there is a valid reason for memory overload (for example, if the number of requests increased).

▶See sudden jumps in load and scale

It's important to create a metric (and display it on a dashboard) to track the number of requests your application receives per second. This gives you a sense of what's normal for your target.

Sometimes there can be a sudden jump in load. You might be notified about this if you track your service's success rate. You could also detect this by comparing your current load against historical data. You want to be notified by a monitoring tool rule when a problem like this occurs.

In this scenario, you have two choices: You can either increase the CPU, using the same steps as you used for memory allocation, or you can increase the number of instances for your pods. The recommended method is to increase the number of instances. For example, increase the number of replicas to five:

$ kubectl scale deployment myDeployment –replicas=5

If you are already tracking the number of requests per second, then you have a rough idea of how many instances you need to handle extra requests.

If you use an autoscaler, you can automate this process.

▶Roll back a faulty deployment

If a recent deployment causes issues, one of the fastest and easiest ways to remedy this is by using rollbacks. To see the deployment history:

$ kubectl rollout history deployment myDeployment

The output looks similar to this:

$ kubectl rollout history deployment myDeployment
 deployment.extensions/myDeployment
REVISION CHANGE-CAUSE
1               <none>
2               <none>
3               <none>
4               <none>

The most recent deployment is the one with the highest number (4, in this example). Perform a rollback to the previous deployment (3):

$ kubectl rollout undo deployment myDeployment –-to-revision=3

You will see new pods created.

It's wise to set your deployment history to save a specific number of versions. Do so by setting revisionHistoryLimit.

For example:

spec:
replicas:2
revisionHistoryLimit:20

This saves 20 recent deployment configurations.

[Learn the basics using this Kubernetes Cheat Sheet]

▶Access a specific log

Assuming you have proper logging in place, you can look at your logs to identify the cause of trouble and when it occurred:

$ kubectl logs myPodName

However, it is possible that the pod's previous instantiation logs are no longer the most recent ones. In this case, execute the following command to get the logs from a previous instance:

$ kubectl logs myPodName –previous

If you have multiple containers running inside the same pod, you must specify the container name to see its logs. If you're using a logging service, it usually takes a while to show the most recent log. In this case, it's often better to look at the logs by using the above commands than to rely on dashboards.

▶SSH into your pod

If none of the above tips worked, it might make sense to use Secure Shell (SSH) to get access inside the pod to perform some basic checks. For instance, you can determine whether you can see the files you expect in the filesystem and whether the log files are present. You can also check whether you're able to make a connection request to some other service directly from this pod. To SSH into a pod:

$ kubectl exec -it myPodName sh

This lets you access the pod through a shell window.

▶Troubleshoot CrashloopBackoff and ImagePullBackoff errors

You might have a monitoring tool like Grafana to monitor the number of instances in your service at any given point. Normally, you want to have a certain minimum number of instances running depending on the size of the load. If a minimum number isn't matched, then it triggers an alert. When the problem is CrashLoopBackOff (your pod is starting, crashing, starting again, and then crashing again), then your service doesn't return a 200 success code. If you're receiving errors, this can be an indication of a performance problem.

If a kubectl get pods command returns the following output, then you know you have a pod in a CrashLoopBackOff state:

$ kubectl get pods
NAME                   READY  STATUS            RESTARTS   AGE
myDeployment1-89234... 1/1    Running           1          17m
myDeployment1-46964... 0/1    CrashLoopBackOff  2          1m

There can be many reasons for this error. You may have to do kubectl describe pod to get to the root of this. Here's a summary of possible reasons and some tips:

Your Dockerfile doesn't have a command (CMD), so your pod immediately exits after starting. Kubernetes automatically restarts the pod when it's managed by a deployment or ReplicaSet.
You have used the same port for two containers inside the same pod. All containers inside the same pod have the same internet protocol (IP) address. They are not permitted to use the same ports. You need a separate port for every container within your pod.
Kubernetes can't pull the image you have specified and therefore keeps crashing. This is an example of ImagePullBackoff.
Run kubectl logs podName to get more information about what caused the error.
If you don't see anything useful in your logs, consider deploying your application with a sleep command for a few minutes. This might help you see some logs before the application crashes. It also might help you figure out whether your application has a code bug or a mistake in its configuration. If it has a configuration problem, you may not see any code errors (because it fails before it reaches them).

These are some common issues observed in Kubernetes and their troubleshooting methods to find and fix them. I hope it is helpful for your knowledge.

Kubernetes Troubleshooting

#Day7 Last Day of #KubeWeek challenge

Table of contents

No headings in the article.