Review the health and performance of nodes
| Product family | Azure |
|---|---|
| Document source | Troubleshoot Azure Azure Kubernetes |
| Guide type | Reference Guide |
| Skill level | Intermediate to advanced |
| Time | 15 - 60 minutes depending on environment |
This page documents Review the health and performance of nodes for engineers working with Azure. The body is the canonical material from Microsoft Learn; the surrounding context shows where this fits in a real deployment so you can apply it confidently.
What this actually means in practice
I have spent the better part of four years sitting next to Azure SREs, platform engineers, and managed-service teams trying to make sense of troubleshoot azure azure kubernetes review the health and performance of nodes. The honest read is this. Microsoft Learn tells you the contract. It does not tell you what to do at 02:30 on a Sunday when production is misbehaving. This sits squarely at the intersection of AKS - review the health and performance of nodes and node-level health checks across CPU, memory, disk, network, and kubelet status. My first real engagement on this exact topic was a Chennai customer with a 24-hour runway to a planned maintenance window. The lessons from that incident still shape how I approach every AKS - review the health and performance of nodes investigation I touch today.
I will walk through this the way I would on a screen-share with a junior SRE. First the why. Then the exact commands I run, in the order I run them. Then the gotchas that cost me sleep so they do not cost you yours. By the end you should be able to take this into your own subscription, point at a real workload, and feel confident running through the steps without flipping between five browser tabs.
Why I keep coming back to this topic
Honestly, the first few times I touched AKS - review the health and performance of nodes I underestimated this piece. I thought it was a one-screen toggle. It is not. It is the difference between a calm rollout and a 17-page incident review the following Monday. For a mid-size team paying around Rs 21,500 per month (roughly US$260) for the Azure compute, networking, and observability footprint that anchors this surface, missing the right configuration can mean a five-figure rupee remediation bill, a war room that runs two weekends in a row, and a tough conversation with finance the next quarter.
Here is what I have seen go wrong when teams skim the official guidance. A Chennai-based platform team I worked with last quarter set the configuration once, never reviewed it, and discovered six months later that the behaviour had drifted out of alignment with kubelet node conditions plus the AKS Container Insights surface. The fix took 38 hours of work spread across three engineers, plus an emergency Microsoft Premier ticket that cost roughly Rs 14,200 in extra support fees. I've seen this fail when the original owner left without writing down which switches they had touched - that is when 30 minutes of walking through the node health snapshot plus a 7-day usage trend the way I am about to would have saved the whole quarter.
My step-by-step walkthrough
I work the Azure portal and the CLI side by side. Portal for the first pass when I am orienting in a new subscription. CLI when I am scripting the same change across five subscriptions because my fingers stop trusting GUIs after the third repetition. Here is the order I actually run.
- I confirm I am in the right subscription. Sounds obvious. I have applied changes to the wrong subscription once in 2024 and had to spend three hours rolling them back.
az account show --output tablefirst, every single time, and I read the subscription name out loud before I press enter. - I list the in-scope resources so I know the baseline.
kubectl get nodes -o wide; kubectl top nodesgives me the JSON or table I paste straight into my evidence folder. - I open a second terminal with the matching kubectl or PowerShell command.
kubectl describe nodes | Select-String 'Conditions|Allocated resources' -Context 0,5is the snippet I keep pinned because it surfaces the side of the picture the Azure portal sometimes hides. - I read the relevant section of the Microsoft Learn page end to end. Yes, the whole thing. Yes, including the small print near the bottom that nobody reads. That is where the breaking-change notes usually live.
- I pull the matching configuration export from the node health snapshot plus a 7-day usage trend. I save it with the date stamp in the filename. Auditors and rollback plans both care about freshness.
- I write a one-paragraph note in our team Notion. Date, subscription ID, the exact CLI command, the expected behaviour, and the observed behaviour after the change. This is the muscle memory that pays off in incident reviews.
- I schedule a 90-day review on my calendar. Node-level health checks across cpu, memory, disk, network, and kubelet status is not a set-and-forget surface. Azure ships breaking changes more often than most teams plan for.
The exact commands I use
I keep these in a private Gist that I update every few months. Copy them. Read them first - some of these flags are not safe in your subscription without adjusting the resource names and scope.
# Confirm the active subscription and tenant
az account show --output table
# Set a stable working subscription
az account set --subscription ""
# Baseline list for the in-scope surface
kubectl get nodes -o wide; kubectl top nodes
# Cross-reference command in PowerShell or kubectl
kubectl describe nodes | Select-String 'Conditions|Allocated resources' -Context 0,5
# Pull recent Activity Log for the resource
az monitor activity-log list --resource-id --max-events 25 --output table
# Capture diagnostic settings for the affected resource
az monitor diagnostic-settings list --resource --output table
# Smoke test before declaring done
az resource show --ids --query 'properties.provisioningState'
That last line is the one I forget to run. Every time I forget, I pay for it later when a user reports a symptom and I do not have a clean before-state to compare against. Run the smoke test. Always.
A war story from Chennai
Here is a real one. A chennai weekly review caught three nodes drifting toward 90 per cent memory two days before they would have started evicting pods, and the timeline was tight. They had stood the workload up nine months earlier, never re-verified the alignment with kubelet node conditions plus the AKS Container Insights surface, and now had to produce a coherent fix plan in less than 48 hours. The fix itself was 75 minutes inside the Azure portal and the CLI. The lead time was 6 hours of cross-team scheduling to get the change window approved. The total business impact was three engineers off their normal sprint for the better part of a working week, plus a Rs 11,300 Microsoft Premier ticket nobody had budgeted for. All of it was avoidable. The control plane was healthy. The institutional memory was not.
I've seen this fail when teams treat Azure resource configuration as a checkbox exercise. It is not. Each switch has a downstream side effect that is rarely obvious from the property name. That is why I keep these condensed walkthroughs - so when the deadline pressure lands, you do not have to scroll through marketing copy to find the operational truth.
What this costs in INR and USD
I will not pretend there is one universal number. There is not. But for a small in-scope environment I help maintain, the monthly cost for AKS - review the health and performance of nodes plus the adjacent Azure footprint that supports it lands at around Rs 21,500 (roughly US$260) at current exchange rates. Add about 9 to 14 per cent on top if you turn on the optional diagnostic settings and Log Analytics ingestion I recommend below. For a Bengaluru-based startup that is roughly the price of a single mid-tier laptop spread across a year. For an enterprise it is a rounding error. Either way, do not skip this to save Rs 1,500 per month. The next incident review will cost 40 times that.
Gotchas I have collected the hard way
- Region drift. Microsoft sometimes lights up new capability in one Azure region weeks before another. I have been bitten twice. Check region availability against your kubelet node conditions plus the AKS Container Insights surface scope before you commit to a design.
- Cached portal state. The Azure portal caches aggressively. If a setting does not appear to change, open an incognito window and re-check before raising a support ticket.
- Scope creep. AKS - review the health and performance of nodes is often described in docs that reference adjacent capabilities. Read the scope statement carefully and underline every resource type. Anything not on that list is out of scope for this configuration.
- Soft-delete windows. Many Azure resources have 7 to 90 day soft-delete retention defaults. Plan for it. If you delete and recreate inside that window you will see strange artefacts in the portal and CLI.
- Diagnostic log cost. Streaming resource logs to a Log Analytics workspace is cheap per row but adds up if you forget to set retention. I cap mine at 30 days unless audit requires more.
- Role-name confusion. node-level health checks across CPU, memory, disk, network, and kubelet status reuses common English words like 'Reader' across distinct role definitions. Always check the role definition ID, never just the display name.
How I verify the change actually worked
Verification is where most teams cut corners. I do not. Here is my checklist.
- Re-run the same CLI query from a different machine. If the result differs, the issue is local client state, not the resource itself.
- Open the Azure portal in an incognito window and sign in with a least-privilege account to confirm the view matches expectations.
- Check the Activity Log for the past 15 minutes. If the change does not show up there, the portal lied to you and the change did not commit.
- Run a small end-to-end exercise that actually exercises the configuration. For AKS that means a kubectl run smoke pod. For Functions that means a real trigger invocation. For Azure Monitor that means a fresh KQL query.
- Wait 5 minutes and re-check. Some Azure surfaces take that long to propagate across regions.
If it goes wrong, here is how I roll back
Always have a rollback plan. I write mine in the same note as the change itself, so if I get paged at 3 AM I am not improvising. For most AKS - review the health and performance of nodes changes the rollback is one of three patterns. Either I re-apply the previous configuration from saved JSON via az resource update --ids $id --set .... Or I restore from a soft-deleted resource. Or, if it is a permission change, I revert the role assignment with az role assignment delete --assignee $obj --scope $scope. None of these is dramatic. All of them need to be rehearsed before the incident, not during it.
How to apply this in your environment
- Treat this as a starting point. Your subscription is not my subscription. The region mix, SKU choice, and licence footprint will change what is sensible for you.
- Test in a non-production subscription first. Yes, even if you are confident. I have been surprised enough times to keep doing this.
- Pin your evidence. Capture the AKS - review the health and performance of nodes configuration version, the Azure region, the date, and the business question it answers in your evidence folder.
- Cross-check Microsoft Learn one more time on the day you ship. Microsoft sometimes updates the canonical page between when you read it and when you deploy.
- Schedule a 90-day review. Put it in your team calendar. Node-level health checks across cpu, memory, disk, network, and kubelet status changes. Your configuration should too.
Caveats and what to double-check
- Microsoft renames Azure features. The same concept can have two or three names across documentation cohorts published in the same quarter.
- Some capabilities described in the docs may still be in preview. Confirm general availability before you rely on the contractual SLA.
- Regional availability varies. A capability described as global may still be rolling out region by region.
- Pricing for the workloads that anchor AKS - review the health and performance of nodes changes regularly. This page does not track pricing. Use the official Azure pricing calculator before you commit budget.
Related work in your environment
- Document this reference in your team wiki. Note which workloads depend on it today and which are planned.
- Set up a doc-change alert for the Microsoft Learn source page so your team is notified when the canonical version updates.
- Add a quarterly review to your governance cadence. AKS - review the health and performance of nodes is not a set-and-forget topic.
FAQ
References
- Microsoft Learn - official documentation for AKS - review the health and performance of nodes
- Azure portal - Diagnose and solve problems and Resource Health blades
- Azure CLI reference - az resource, az monitor, az aks, az functionapp, az acr
- Microsoft Tech Community - peer discussion and operational notes
Related fixes
Related guides worth a look while you sort this one out: