Azure AKS Troubleshooting Hands-On - Node Not Ready Due to Disk Pressure

Azure AKS Troubleshooting Hands-On - Node Not Ready Due to Disk Pressure

Azure Learning Path for Cloud and DevOps Engineers

📝Introduction

In this hands-on lab, we will guide for troubleshooting a real scenario in Azure Kubernetes Service (AKS) for a common issue: a Node Not Ready Due to Disk Pressure.

Learning objectives:

In this module, you'll learn how to:

  • Identify the issue

  • Resolve the issue

📝Log in to the Azure Management Console

Using your credentials, make sure you're using the right Region. In my case, I am using the region uksouth in my Cloud Playground Sandbox.

📌Note: You can also use the VSCode tool or from your local Terminal to connect to Azure CLI

More information on how to set it up is at the link.

📝Prerequisites:

  • Update to PowerShell 5.1, if needed.

  • Install .NET Framework 4.7.2 or later.

  • Visual Code

  • Web Browser (Chrome, Edge)

  • Azure CLI installed

  • Azure subscription

  • Docker installed

📝Setting an Azure Storage Account to Load Bash or PowerShell

  • Click the Cloud Shell icon (>_) at the top of the page.

  • Click PowerShell.

  • Click Show Advanced Settings. Use the combo box under Cloud Shell region to select the Region. Under Resource Group and Storage account(It's a globally unique name), enter a name for both. In the box under File Share, enter a name. Click ***Create storage (***if you don't have any yet).

📝Create an AKS Cluster

  1. Create an AKS cluster using the az aks create command, but before storing the name of the cluster inside a variable named CLUSTERNAME.

    Copy

      CLUSTERNAME=<AKSClusterName>
      az aks create -n $CLUSTERNAME -g $RG --node-vm-size Standard_D2s_v3 --node-count 2 --generate-ssh-keys
    

📝 Connect to AKS Cluster

Use the Azure Portal to check your AKS Cluster resources, by following the steps below:

  1. Go to Azure Dashboard, and click on the Resource Group created for this Lab, looking for your AKS Cluster resource.

  2. On the Overview tab, click on Connect to your AKS Cluster**.**

  3. A new window will be opened, so you only need to open the Azure CLI and run the following commands:

az login
az account set subscription <your-subscription-id>
az aks get-credentials -g <nameRersourceGroup> -n <nameAKSCluster> --overwrite-existing

After that, you can run some Kubectl commands to check the default AKS Cluster resources.

📝Simulate the Issue:

  • Deploy a Sample Application: Create a deployment YAML file (nginx-deployment.yaml) with a large number of replicas:

      apiVersion: apps/v1
      kind: Deployment
      metadata:
        name: nginx-deployment
      spec:
        replicas: 50
        selector:
          matchLabels:
            app: nginx
        template:
          metadata:
            labels:
              app: nginx
          spec:
            containers:
            - name: nginx
              image: nginx:latest
    
  • Apply the Deployment:

      kubectl apply -f nginx-deployment.yaml
    

📝Identify the Issue:

  • Check Node Status:

      kubectl get nodes
    

  • Describe the Node:

      kubectl describe node <node-name>
    

    Look for conditions indicating “DiskPressure”, however, if you have a cluster with enough resources available like me, maybe you will not see this condition :)

    But in case you have it, please, follow the next steps for troubleshooting the issue.

📝Troubleshoot the Issue:

  • Check Disk Usage: Connect into the node and check disk usage:

      kubectl debug node/<node-name> -it --image=mcr.microsoft.com/cbl-mariner/busybox:2.0
      df -Th
    

  • Check Pod Disk Usage: Identify pods consuming high disk space:

      kubectl top pods --sort-by=memory
    

  • Check Logs: Check logs for any errors related to disk usage:

      kubectl logs <pod-name>
    

📝Resolve the Issue:

  • Clean Up Disk Space: Delete unnecessary files or logs on the node.

  • Scale Down the Deployment: Reduce the number of replicas in the deployment YAML file:

      spec:
        replicas: 10
    
  • Reapply the Deployment:

      kubectl apply -f nginx-deployment.yaml
    
  • Check Node Status Again:

      kubectl get nodes
    
  • Describe the Node:

      kubectl describe node <node-name>
    

    Ensure the “DiskPressure” condition is resolved.

📌Note - At the end of each hands-on Lab, always clean up all resources previously created to avoid being charged.

Congratulations — you have completed this hands-on lab covering the basics of Troubleshooting an AKS Node Not Ready Due to Disk Pressure.

Thank you for reading. I hope you understood and learned something helpful from my blog.

Please follow me on Cloud&DevOpsLearn and LinkedIn, franciscojblsouza