Azure Kubernetes Service Tutorial: Production Best Practices

This Azure Kubernetes Service tutorial guides experienced backend engineers through AKS deployment and production readiness. Learn advanced configurations for s

Murat Doğan

11 min read
0

/

Azure Kubernetes Service Tutorial: Production Best Practices

Most teams adopt Kubernetes to achieve robust scalability and portability. But simply deploying Azure Kubernetes Service (AKS) without a production-grade strategy commonly leads to unforeseen operational overhead, escalating infrastructure costs, and significant security vulnerabilities at scale. A deliberate, optimized approach is critical for maintaining performance and resource efficiency.


  • TL;DR

* Unoptimized AKS deployments often incur higher costs and expose critical security gaps in production environments.

* Strategic node pool configuration, including Spot instances, can significantly reduce operational expenditure while maintaining workload availability.

* Leverage Azure CNI and Network Policies for granular control over pod-to-pod and pod-to-external communication, enhancing security posture.

* Integrate Azure Policy and Microsoft Defender for Cloud to enforce compliance and provide continuous threat protection.

* Proactive monitoring with Azure Monitor for Containers is essential for identifying performance bottlenecks and optimizing resource allocation.


The Problem


In an ideal world, containerizing applications and deploying them to Kubernetes provides immediate scalability and resilience. However, many organizations find themselves struggling with AKS deployments that, while functional, fail to meet production demands efficiently. Teams commonly report 30-50% higher-than-expected cloud spend due to inefficient resource provisioning, or encounter critical downtime from unexpected resource starvation because scaling mechanisms were not configured correctly. These issues often stem from overlooking critical aspects like network topology, security hardening, or right-sizing compute resources from the outset. Without a comprehensive strategy, an Azure Kubernetes Service cluster quickly evolves from a powerful orchestrator into a complex, costly operational burden.


How It Works


Understanding AKS Architecture for Production Workloads


Azure Kubernetes Service simplifies the deployment and management of Kubernetes clusters in Azure. While Kubernetes handles the control plane, AKS abstracts this complexity, offering it as a managed service. This means Azure manages components like the API server, etcd, scheduler, and controller manager, allowing you to focus on your applications and node management.


The core of an AKS deployment involves node pools, which are groups of virtual machines that host your containerized applications. Distinguishing between system node pools and user node pools is crucial for production. System node pools are dedicated to hosting critical system pods (like CoreDNS, kube-proxy) and should typically use General Purpose VMs (e.g., D-series) with adequate CPU and memory. User node pools, on the other hand, run your actual application workloads and can be optimized with different VM sizes, types (e.g., compute-optimized, memory-optimized), or even Spot VMs to match specific workload requirements and cost profiles.


Interactions between AKS and Azure services are handled through Managed Identities, a more secure alternative to service principals. These identities allow your cluster and workloads to authenticate with Azure APIs (e.g., Azure Container Registry, Azure Key Vault) without storing credentials directly. For networking, AKS offers two primary models: Kubenet and Azure CNI. While Kubenet is simpler for basic setups, Azure CNI is the de facto choice for production environments requiring advanced networking capabilities like Virtual Network integration, network policies, and direct IP assignment to pods (Microsoft Docs: Azure CNI Overlay).


Advanced Scaling and Optimization with AKS


Achieving optimal performance and cost efficiency in AKS requires a multi-faceted approach to scaling. Kubernetes provides native mechanisms like Horizontal Pod Autoscaler (HPA) and Vertical Pod Autoscaler (VPA), which interact directly with the cluster autoscaler and node pools.


  • Horizontal Pod Autoscaler (HPA): This scales the number of pod replicas based on observed CPU utilization or other custom metrics. It is effective for stateless applications where adding more instances linearly improves throughput. When HPA scales up pods, if there are insufficient resources on existing nodes, it signals to the Cluster Autoscaler (Microsoft Docs: Cluster Autoscaler).

  • Vertical Pod Autoscaler (VPA): VPA recommends or automatically adjusts resource requests (CPU, memory) for pods. This is critical for optimizing resource allocation, preventing resource starvation, and reducing waste. VPA helps right-size individual pods, ensuring they get enough resources without over-provisioning.

  • Cluster Autoscaler (CA): The CA dynamically adjusts the number of nodes in your node pools. When HPA scales up pods and there are no available nodes with sufficient capacity, CA provisions new nodes. Conversely, when nodes become underutilized, CA scales them down.


When VPA and HPA are used simultaneously on the same pods, it's important to understand their interaction. By default, VPA can conflict with HPA if HPA is scaling based on CPU or memory, as VPA might change the very metrics HPA depends on. A common pattern is to use VPA in "recommender" mode for CPU/memory requests, allowing HPA to scale based on custom metrics like queue length or HTTP requests per second. For example, VPA would ensure each pod has an optimal CPU/memory allocation, while HPA ensures there are enough optimally sized pods to handle traffic surges. This combination maximizes resource efficiency without conflicting scaling objectives.


Another powerful optimization is leveraging Azure Spot Virtual Machines for user node pools. Spot VMs offer significant cost savings (often 60-90% off standard prices) by utilizing Azure's surplus capacity. However, they can be preempted with short notice if Azure needs the capacity back. For fault-tolerant, interruptible workloads (e.g., batch processing, dev/test environments, or microservices designed with graceful shutdown), Spot VMs are highly effective. For critical, stateful, or long-running workloads, standard VMs in dedicated user node pools remain the safer choice. The Cluster Autoscaler can manage Spot node pools effectively, scaling them up and down based on demand and preemption events (Microsoft Docs: Multiple Node Pools).


AKS Networking and Security Hardening


Robust networking and security are non-negotiable for production AKS clusters.


  • Azure CNI (Container Network Interface): With Azure CNI, every pod receives an IP address directly from the Azure Virtual Network subnet. This enables direct routing, compatibility with Network Security Groups (NSGs), and Azure Firewall for fine-grained traffic control. It integrates seamlessly with existing VNet infrastructure and allows your pods to communicate directly with other Azure services as if they were VMs on the same network.

  • Kubernetes Network Policies: Built on top of Azure CNI, Network Policies define how groups of pods are allowed to communicate with each other and with external endpoints. They function as a distributed firewall at the pod level, enabling a "zero-trust" security model where only explicitly allowed traffic can flow. For instance, you can restrict a frontend service to only communicate with its backend service and deny all other ingress/egress.

  • Azure Policy for AKS: Azure Policy can enforce organizational standards and assess compliance at scale. For AKS, policies can dictate allowed Kubernetes versions, enforce specific resource requests/limits, or ensure certain security configurations (e.g., disabling anonymous access, using only approved images) (Microsoft Docs: Azure Policy for AKS).

  • Microsoft Defender for Cloud (formerly Azure Security Center): This provides advanced threat protection for your AKS clusters. It continuously monitors your cluster's configuration and runtime activities for vulnerabilities and suspicious behaviors, offering recommendations and actionable alerts. Integrating Defender for Cloud is a critical step in maintaining a strong security posture (Microsoft Docs: Defender for Cloud).


Step-by-Step Implementation: Deploying a Production-Ready AKS Cluster


This section guides you through deploying an AKS cluster with production considerations for scaling, networking, and security. Ensure you have the Azure CLI and `kubectl` installed and configured. All dates are set to 2026.


1. Prerequisites and Resource Group Setup


First, define your environment variables and create an Azure Resource Group. This encapsulates all your AKS-related resources.


  • Define environment variables and create a resource group

$ RESOURCE_GROUP="aks-production-rg-2026"

$ LOCATION="eastus"

$ CLUSTER_NAME="backendstack-aks-2026"

$ AKSVNETNAME="aks-vnet-2026"

$ AKSSUBNETNAME="aks-subnet-2026"

$ ACR_NAME="backendstackacr2026" # Ensure this ACR exists or create it.


$ az group create --name $RESOURCE_GROUP --location $LOCATION

  • Expected Output

{

"id": "/subscriptions/<SUBSCRIPTION_ID>/resourceGroups/aks-production-rg-2026",

"location": "eastus",

"name": "aks-production-rg-2026",

"properties": {

"provisioningState": "Succeeded"

},

"tags": null,

"type": "Microsoft.Resources/resourceGroups"

}


2. Create a Virtual Network and Subnet for Azure CNI


For production, Azure CNI is recommended. This requires pre-configuring a VNet and subnet.


  • Create a VNet and subnet for AKS with Azure CNI

$ az network vnet create \

--resource-group $RESOURCE_GROUP \

--name $AKSVNETNAME \

--address-prefixes 10.0.0.0/8 \

--location $LOCATION


$ az network vnet subnet create \

--resource-group $RESOURCE_GROUP \

--vnet-name $AKSVNETNAME \

--name $AKSSUBNETNAME \

--address-prefixes 10.240.0.0/16

  • Expected Output (partial for brevity)

{

"id": "/subscriptions/.../resourceGroups/aks-production-rg-2026/providers/Microsoft.Network/virtualNetworks/aks-vnet-2026",

"location": "eastus",

"name": "aks-vnet-2026",

...

}

{

"id": "/subscriptions/.../resourceGroups/aks-production-rg-2026/providers/Microsoft.Network/virtualNetworks/aks-vnet-2026/subnets/aks-subnet-2026",

"name": "aks-subnet-2026",

"properties": {

"addressPrefix": "10.240.0.0/16",

...

}

}


3. Deploy the AKS Cluster with Production-Ready Configurations


This step includes enabling Azure CNI, Managed Identities, RBAC, and auto-scaling. We'll start with a system node pool and add a user node pool later.


  • Retrieve the subnet ID for AKS deployment

$ AKSSUBNETID=$(az network vnet subnet show \

--resource-group $RESOURCE_GROUP \

--vnet-name $AKSVNETNAME \

--name $AKSSUBNETNAME \

--query id -o tsv)


  • Create AKS cluster with Azure CNI, Managed Identity, Azure RBAC, and Cluster Autoscaler

$ az aks create \

--resource-group $RESOURCE_GROUP \

--name $CLUSTER_NAME \

--node-count 1 \

--node-vm-size StandardDS2v2 \

--network-plugin azure \

--vnet-subnet-id $AKSSUBNETID \

--enable-managed-identity \

--enable-rbac \

--kubernetes-version 1.28.5 \

--load-balancer-sku standard \

--enable-addons azure-policy,monitoring \

--enable-cluster-autoscaler \

--min-count 1 \

--max-count 3 \

--location $LOCATION \

--attach-acr $ACR_NAME # Attach an existing ACR for image pulling. Ensure ACR exists.

  • Common mistake: Not specifying `--network-plugin azure` or not providing `--vnet-subnet-id`. This defaults to Kubenet, which limits networking flexibility and advanced policy enforcement in production.


  • Expected Output (truncated for brevity, creation takes time)

{

"fqdn": "backendstack-aks-2026-aks-eastus.hcp.eastus.azmk8s.io",

"id": "/subscriptions/.../resourceGroups/aks-production-rg-2026/providers/Microsoft.ContainerService/managedClusters/backendstack-aks-2026",

"location": "eastus",

"name": "aks-production-rg-2026",

"nodeResourceGroup": "MCaks-production-rg-2026backendstack-aks-2026_eastus",

"resourceGroup": "aks-production-rg-2026",

"type": "Microsoft.ContainerService/managedClusters",

"kubernetesVersion": "1.28.5",

"networkProfile": {

"networkPlugin": "azure",

...

},

"enableRbac": true,

"identity": {

"type": "SystemAssigned",

...

},

"aadProfile": null,

...

}


4. Configure `kubectl` Access


  • Get cluster credentials

$ az aks get-credentials --resource-group $RESOURCEGROUP --name $CLUSTERNAME --overwrite-existing

  • Verify connectivity

$ kubectl get nodes

  • Expected Output

NAME STATUS ROLES AGE VERSION

aks-systempool-12345678-vmss000000 Ready agent 5m5s v1.28.5


5. Add a User Node Pool with Spot VMs


For non-critical or batch workloads, adding a Spot VM node pool to optimize costs is effective.


  • Add a user node pool with Spot VM configuration

$ az aks nodepool add \

--resource-group $RESOURCE_GROUP \

--cluster-name $CLUSTER_NAME \

--name spotpool \

--node-vm-size StandardD2sv3 \

--priority Spot \

--eviction-policy Delete \

--spot-max-price -1 \

--node-count 1 \

--enable-cluster-autoscaler \

--min-count 0 \

--max-count 5 \

--labels workload=spot \

--tags env=devtest costcenter=batch_2026

  • Common mistake: Using Spot VMs for stateful, critical applications that cannot tolerate interruptions. Ensure your applications can gracefully handle preemption. `--spot-max-price -1` means you'll pay up to the standard VM price.


  • Expected Output (partial)

{

"name": "spotpool",

"orchestratorVersion": "1.28.5",

"osDiskSizeGb": 128,

"osType": "Linux",

"priority": "Spot",

"provisioningState": "Succeeded",

"resourceGroup": "aks-production-rg-2026",

...

}

  • Verify node pools

$ az aks nodepool list --resource-group $RESOURCEGROUP --cluster-name $CLUSTERNAME -o table

  • Expected Output

Name OsDiskSizeGB Count VmSize MaxPods ProvisioningState Mode

---------- ------------ ------- --------------- ------- ----------------- ------

systempool 128 1 StandardDS2v2 30 Succeeded System

spotpool 128 1 StandardD2sv3 30 Succeeded User


6. Deploy a Sample Application with HPA and VPA (Recommender Mode)


This example shows a deployment that benefits from VPA recommendations and HPA for horizontal scaling.


  • Deploy a sample Nginx application to the `spotpool`

apiVersion: apps/v1

kind: Deployment

metadata:

name: nginx-deployment

labels:

app: nginx

spec:

replicas: 1

selector:

matchLabels:

app: nginx

template:

metadata:

labels:

app: nginx

spec:

nodeSelector:

workload: spot # Target the spot node pool

containers:

- name: nginx

image: nginx:1.21.6 # Using a specific stable version

ports:

- containerPort: 80

resources:

requests:

cpu: "100m"

memory: "128Mi"

limits:

cpu: "200m"

memory: "256Mi"

WRITTEN BY

Murat Doğan

Microsoft Azure MVP, 6 years in the Microsoft ecosystem. Computer Engineering graduate, Yıldız Technical University. Contributes to Azure, AKS and DevOps content.Read more

Responses (0)

    Hottest authors

    View all

    Ahmet Çelik

    Lead Writer · ex-AWS Solutions Architect, 8 yrs · AWS, Terraform, K8s

    Alp Karahan

    Contributor · MongoDB certified, NoSQL specialist · MongoDB, DynamoDB

    Ayşe Tunç

    Lead Writer · Engineering Manager, ex-Meta, Google · System Design, Interviews

    Berk Avcı

    Lead Writer · Principal Backend Eng., API design · REST, GraphQL, gRPC

    Burak Arslan

    Managing Editor · Content strategy, developer marketing

    Cansu Yılmaz

    Lead Writer · Database Architect, 9 yrs Postgres · PostgreSQL, Indexing, Perf

    Popular posts

    View all
    Deniz Şahin
    ·

    GCP vs AWS vs Azure: Serverless Comparison 2026

    GCP vs AWS vs Azure: Serverless Comparison 2026
    Murat Doğan
    ·

    <h1 class="text-3xl font-bold mb-4">Azure DevOps Pipelines: Serverless Functions CI/CD</h1>

    <h1 class="text-3xl font-bold mb-4">Azure DevOps Pipelines: Serverless Functions CI/CD</h1>
    Ahmet Çelik
    ·

    AWS VPC Peering vs Transit Gateway vs PrivateLink

    AWS VPC Peering vs Transit Gateway vs PrivateLink
    Murat Doğan
    ·

    AKS vs EKS vs GKE: A Deep Production Comparison (2026)

    AKS vs EKS vs GKE: A Deep Production Comparison (2026)
    Elif Demir
    ·

    Docker Tutorial for Beginners 2025: Production Basics

    Docker Tutorial for Beginners 2025: Production Basics
    Elif Demir
    ·

    Docker Compose vs Kubernetes: Production Orchestration

    Docker Compose vs Kubernetes: Production Orchestration