Why is Kubernetes being called the 'operating system' for AI?

According to CNCF data, 66% of generative AI workloads now run on Kubernetes. KubeCon 2026 showcased K8s' transformation from container orchestration to a full-stack AI platform: fine-grained GPU scheduling (DRA driver, KAI scheduler), native LLM inference orchestration (llm-d framework), AI Agent lifecycle management (MCP integration), and cloud-native AI security. These capabilities make K8s the unified control plane for AI workloads from training to inference, deployment to monitoring.

What are the differences between GPU Time-Slicing and MIG, and when should each be used?

GPU Time-Slicing is a software-level GPU sharing solution where multiple workloads alternate GPU usage in the time dimension. It requires no special hardware but lacks memory isolation, making it suitable for inference tasks with higher latency tolerance. NVIDIA MIG partitions a physical GPU into independent instances at the hardware level, each with dedicated compute, memory, and bandwidth, providing hardware-level isolation for production environments requiring performance guarantees. A100/H100 GPUs support up to 7 MIG instances.

What pain points does the llm-d framework address in LLM deployment?

llm-d addresses three core challenges: inference-aware traffic management that routes requests to optimal nodes based on KV cache state; native orchestration for multi-node large models with automatic tensor and pipeline parallelism management; and hardware-agnostic design supporting NVIDIA, AMD, Intel, and other platforms. It also redefines Service Level Indicators for LLM inference, introducing AI-specific metrics like Time to First Token (TTFT) and Time Per Output Token (TPOT).