MattOffPrem - Notes about Azure

This past week saw Amsterdam host KubeCon + CloudNativeCon Europe 2026, with Microsoft hosting their traditional PreDay and having a presence at the conference. I was unable to attend, but Brendan Burns published a roundup blog post

Upstream Kubernetes: DRA Graduates to GA

The headline upstream contribution is that Dynamic Resource Allocation (DRA) has graduated to general availability. DRA makes GPU-backed workloads first-class citizens in Kubernetes scheduling, so clusters can manage hardware resources with the same declarative model used for everything else. The DRA example driver and DRA Admin Access shipped alongside it. DRANet now includes upstream compatibility for Azure RDMA NICs, which matters for GPU-to-NIC topology alignment in training workloads.

Workload Aware Scheduling for Kubernetes 1.36 adds DRA support in the Workload API and drives integration into KubeRay, making it more straightforward for developers to request and manage high-performance infrastructure for training and inference.

Two new CNCF projects also got attention. AI Runway is an open-source project that introduces a common Kubernetes API for inference workloads, with a web interface, HuggingFace model discovery, GPU memory fit indicators, and support for runtimes including NVIDIA Dynamo, KubeRay, llm-d, and KAITO. Meanwhile, Dalec was onboarded as a CNCF project for declarative system package specifications and minimal container images with SBOM generation and provenance attestations.

Identity-Aware Networking

This is probably the most significant theme for teams managing production AKS clusters. The shift from IP-based controls to application-layer, identity-aware networking addresses a real operational pain point.

Azure Kubernetes Application Network is a new preview that provides mutual TLS, application-aware authorization, and traffic telemetry across ingress and in-cluster communication, with built-in multi-region connectivity. The pitch is identity-aware security and real traffic insight without the overhead of running a full-service mesh. For teams that have been on the fence about adopting a service mesh because of the operational burden, this could be a compelling middle ground.

Application Routing with Meshless Istio is another preview worth paying attention to, particularly for teams dealing with the announced deprecation of ingress-nginx. It provides Kubernetes Gateway API support without sidecars, continued support for existing ingress-nginx configurations, and contributions to ingress2gateway for teams that need to move incrementally. The ingress-nginx retirement has been creating a lot of noise in the community, so having a well-supported Azure-native migration path is welcome.

At the data plane level, WireGuard encryption with the Cilium data plane secures node-to-node traffic efficiently and without application changes. Cilium mTLS in Advanced Container Networking Services extends that to pod-to-pod communication using X.509 certificates and SPIRE for identity management, giving authenticated and encrypted workload traffic without sidecars. This is still in preview but the direction of travel is clear: encryption everywhere, without the sidecar tax.

Pod CIDR expansion removes a long-standing operational constraint by allowing clusters to grow their pod IP ranges in place rather than requiring a rebuild. Anyone who has run into this limitation knows how painful it is. Being able to expand in place rather than rebuild is a meaningful improvement.

Observability Improvements

Two GA features landed here. Container network metrics filtering for AKS lets operators dynamically control which container-level metrics are collected using Kubernetes custom resources, keeping dashboards focused on actionable signals rather than drowning in noise. Container network logs in AKS provide per-flow L3/L4 and supported L7 visibility across HTTP, gRPC, and Kafka traffic, including IPs, ports, workloads, flow direction, and policy decisions, with a new Azure Monitor experience that brings built-in dashboards and one-click onboarding.

AKS managed GPU metrics in Azure Monitor is in preview, surfacing GPU performance and utilization directly into managed Prometheus and Grafana. Teams running GPU workloads have often had a significant monitoring blind spot here, so putting GPU telemetry into the same stack used for capacity planning and alerting is a welcome addition.

Agentic container networking adds a web-based interface that translates natural-language queries into read-only diagnostics using live telemetry. I have not tried this yet but the concept of shortening the path from “something is wrong” to “here is what to do about it” through natural language is interesting. It is also in preview.

Multi-Cluster Operations and Storage

Cross-cluster networking in Azure Kubernetes Fleet Manager is now in preview, providing a managed Cilium cluster mesh for unified connectivity across AKS clusters, a global service registry for cross-cluster service discovery, and intelligent routing managed centrally. For organizations running workloads across multiple clusters, this has historically meant custom plumbing and inconsistent service discovery, so having a managed solution is a step forward.

AKS Desktop is now generally available. It brings a full AKS experience to the desktop, making it straightforward for developers to run, test, and iterate on Kubernetes workloads locally with the same configuration they will use in production.

Safer Upgrades and Faster Recovery

This is where I think the most practical improvements landed. Blue-green agent pool upgrades (preview) create a parallel pool with the new configuration rather than applying changes in place. Teams can validate behaviour before shifting traffic and maintain a clear rollback path if something goes wrong. Agent pool rollback complements this by allowing teams to revert a node pool to its previous Kubernetes version and node image when problems surface after an upgrade, without a full rebuild.

Together, these give operators meaningful control over the upgrade lifecycle rather than a choice between “upgrade and hope” or “stay behind.” This has been a real pain point for teams running production workloads on AKS.

Prepared image specification lets teams define custom node images with preloaded containers, operating system settings, and initialisation scripts, reducing startup time and improving consistency for environments that need rapid, repeatable provisioning during scale-out events.

Key Takeaways

DRA graduating to GA is the headline upstream contribution, making GPU workloads properly schedulable in Kubernetes
Identity-aware networking through Azure Kubernetes Application Network and meshless Istio gives teams a realistic alternative to full service mesh adoption
Container network metrics filtering and network logs reaching GA improves day-two operations for teams already running AKS
Blue-green agent pool upgrades and rollback address one of the most common operational anxieties around AKS cluster management
AKS Desktop reaching GA gives developers a local AKS experience matching production configuration

Closing thoughts

There is a lot to digest here. I want to spend some time with the networking improvements in particular, as the shift from IP-based to identity-aware controls feels like it could materially change how I think about cluster security posture.

AKS at KubeCon Europe 2026