Azure AKS Troubleshooting Hands-On - Pod Failing to Insufficient Resources

Azure AKS Troubleshooting Hands-On - Pod Failing to Insufficient Resources

Azure Learning Path for Cloud and DevOps Engineers

📝Introduction

In this hands-on lab, we will guide for troubleshooting a real scenario in Azure Kubernetes Service (AKS) for a common issue: a Pod failing to start due to insufficient resources.

Learning objectives:

In this module, you'll learn how to:

  • Identify the issue

  • Resolve the issue

📝Log in to the Azure Management Console

Using your credentials, make sure you're using the right Region. In my case, I am using the region uksouth in my Cloud Playground Sandbox.

📌Note: You can also use the VSCode tool or from your local Terminal to connect to Azure CLI

More information on how to set it up is at the link.

📝Prerequisites:

  • Update to PowerShell 5.1, if needed.

  • Install .NET Framework 4.7.2 or later.

  • Visual Code

  • Web Browser (Chrome, Edge)

  • Azure CLI installed

  • Azure subscription

  • Docker installed

📝Setting an Azure Storage Account to Load Bash or PowerShell

  • Click the Cloud Shell icon (>_) at the top of the page.

  • Click PowerShell.

  • Click Show Advanced Settings. Use the combo box under Cloud Shell region to select the Region. Under Resource Group and Storage account(It's a globally unique name), enter a name for both. In the box under File Share, enter a name. Click ***Create storage (***if you don't have any yet).

📝Create an AKS Cluster

  1. Create an AKS cluster using the az aks create command, but before storing the name of the cluster inside a variable named CLUSTERNAME.

    Copy

      CLUSTERNAME=<AKSClusterName>
      az aks create -n $CLUSTERNAME -g $RG --node-vm-size Standard_D2s_v3 --node-count 2 --generate-ssh-keys
    

📝 Connect to AKS Cluster

Use the Azure Cloud Shell to check your AKS Cluster resources, by following the steps below:

  1. Go to Azure Dashboard, and click on the Resource Group created for this Lab, looking for your AKS Cluster resource.

  2. On the Overview tab, click on Connect to your AKS Cluster.

  3. A new window will be opened, so you only need to open the Azure CLI and run the following commands:

az login
az account set subscription <your-subscription-id>
az aks get-credentials -g <nameRersourceGroup> -n <nameAKSCluster> --overwrite-existing

After that, you can run some Kubectl commands to check the default AKS Cluster resources.

📝Deploy the Application to AKS

  1. Simulate the Issue:

    Deploy a Sample Application: Create a deployment YAML file (nginx-deployment.yaml) with resource requests that exceed the available resources on the node:

     apiVersion: apps/v1
     kind: Deployment
     metadata:
       name: nginx-deployment
     spec:
       replicas: 1
       selector:
         matchLabels:
           app: nginx
       template:
         metadata:
           labels:
             app: nginx
         spec:
           containers:
           - name: nginx
             image: nginx:latest
             resources:
               requests:
                 memory: "2Gi"
                 cpu: "2"
               limits:
                 memory: "2Gi"
                 cpu: "2"
    
  2. Apply the Deployment:

     kubectl apply -f nginx-deployment.yaml
    

  3. Identify the Issue:

    • Check Pod Status:
    kubectl get pods

  • Describe the Pod:
    kubectl describe pod <pod-name>

Look for events indicating why the pod is not starting. You might see messages like “Insufficient cpu” or “Insufficient memory”.

  1. Troubleshoot the Issue:

  • Check Node Resources:
kubectl top nodes

Verify the available CPU and memory on the nodes.

  • Check Resources Quotas (if any):
kubectl get resourcequotas

  • Check Cluster Autoscaler: Ensure the cluster autoscaler is enabled and configured correctly:

      az aks show -g <nameRersourceGroup> -n <nameAKSCluster> --query "agentPoolProfiles[].enableAutoScaling"
    

  1. Resolve the Issue:

    • Scale Up the Cluster: If the cluster autoscaler is not enabled or not sufficient, maybe manually scale up the cluster is the solution:

        az aks scale -g <nameRersourceGroup> -n <nameAKSCluster> --node-count <new-node-count>
      
    • Adjust Resource Requests: Modify the deployment YAML file to request fewer resources:

        resources:
          requests:
            memory: "1Gi"
            cpu: "1"
          limits:
            memory: "1Gi"
            cpu: "1"
      
    • Reapply the Deployment:

        kubectl apply -f nginx-deployment.yaml
      
  1. Verify the Resolution:

  • Check Pod Status Again:

kubectl get pods

  • Describe the Pod:

      kubectl describe pod <pod-name>
    

Ensure there are no error messages and the pod is running.

  • Check Node Resources:

      kubectl top nodes
    

Verify that the nodes have sufficient resources and the pod is running smoothly.

📌Note - At the end of each hands-on Lab, always clean up all resources previously created to avoid being charged.

Congratulations — you have completed this hands-on lab covering the basics of Troubleshooting an AKS Pod failing to start due to insufficient resources.

Thank you for reading. I hope you understood and learned something helpful from my blog.

Please follow me on Cloud&DevOpsLearn and LinkedIn, franciscojblsouza