Fractional GPUs in Kubernetes
Overview
The GenAI revolution has led to a surge in GPU demand across the industry. Companies want to train, fine-tune and deploy LLMs in massive quantities. This has meant lower availability and consequent increase in prices for the latest GPUs. Companies running workloads on public cloud have suffered from high prices and increasing uncertainty wrt GPU availability.
These new realities make being able to utilize available GPUs to the maximum extent absolutely critical. Partitioning or sharing a single GPUs between multiple processes helps with this. Implementing it on top of kubernetes gives a winning combination where we get autoscaling and a sophisticated scheduler to help with optimizing GPU utilization.
Options for sharing GPUs
In order to share a single GPU with multiple workloads in kubernetes, these are the options we have -
MIG
Multi-Instance GPU (MIG) allows GPUs based on the NVIDIA Ampere architecture (such as NVIDIA A100) to be securely partitioned into separate GPU Instances for CUDA applications. Each partition is completely memory and compute isolated and can provide predictable througput and latency
A single NVIDIA A100 GPU can be partitioned in upto 7 isolated GPU instances. Each partition appears as a separate GPU to the software running on a partitioned node. Other MIG supported GPUs and number of supported partitions are listed here.
More info here
Pros
- Full compute and memory isolation that can support predictable latency and throughput
nvidia-device-plugin
for kubernetes has native support for MIG
Cons
- Only supported for recent GPUs like A100, H100, A30. This ends up limiting the options one has
- Number of partitions has a hard limit of 7 for most architectures. This is fairly less if we are running smaller workloads with limited memory and compute requirements
Time slicing
Time slicing enables multiple workloads to be scheduled on the same GPU. Compute time is shared between the multiple processes and the processes are interleaved in time. A cluster administrator can configure a cluster or node to advertise a certain number of replicas/GPU which reconfigures the nodes accordingly.
Pros
- No upper limit to the number of pods that can share a single GPU
- Can work with older versions of NVIDIA GPUs
Cons
- No memory or fault isolation. There is no in built way to make sure a workload doesn’t overrun the memory assigned to it.
- Time slicing provides equal time to all running processes. A pod running multiple processes can hog the CPU much more than intended
Time slicing Demo
Lets go through a short walkthrough on how we can utilize time sharing on Azure Kubernetes Service. We start with an already existing kubernetes cluster.
-
Add a GPU enabled node pool in the cluster -
$ az aks nodepool add \ --name <nodepool-name> \ --resource-group <resource-group-name> \ --cluster-name <cluster-name> \ --node-vm-size Standard_NC4as_T4_v3 \ --node-count 1
This will add a new node pool with a single node to the existing AKS cluster with a single NVIDIA T4 GPU. This can be verified by running the following -
$ kubectl get nodes <gpu-node-name> -o 'jsonpath={.status.allocatable.nvidia\.com\/gpu}'
-
Install the gpu operator -
$ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \ && helm repo update $ helm install gpu-operator nvidia/gpu-operator \ -n gpu-operator --create-namespace \ --set driver.enabled=false \ --set toolkit.enabled=false \ --set operator.runtimeClass=nvidia-container-runtime
-
Once the operator is installed, we create a time slicing configuration and configure the whole cluster to slice the GPU resources where available -
$ kubectl apply -f - <<EOF apiVersion: v1 kind: ConfigMap metadata: name: time-slicing-config data: any: |- version: v1 flags: migStrategy: none sharing: timeSlicing: renameByDefault: false failRequestsGreaterThanOne: false resources: - name: nvidia.com/gpu replicas: 10 EOF # Reconfigure gpu operator to pick up the config map $ kubectl patch clusterpolicy/cluster-policy \ -n gpu-operator --type merge \ -p '{"spec": {"devicePlugin": {"config": {"name": "time-slicing-config", "default": "any"}}}}'
-
Verify that the existing node has been successfully reconfigured -
$ kubectl get nodes <gpu-node-name> -o 'jsonpath={.status.allocatable.nvidia\.com\/gpu}' 10
-
We can verify the configuration by creating a deployment with 4 replicas with each asking for 2 nvidia.com/gpu resource -
$ kubectl apply -f - <<EOF apiVersion: apps/v1 kind: Deployment metadata: name: time-slicing-verification labels: app: time-slicing-verification spec: replicas: 4 selector: matchLabels: app: time-slicing-verification template: metadata: labels: app: time-slicing-verification spec: tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule hostPID: true containers: - name: cuda-sample-vector-add image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04" command: ["/bin/bash", "-c", "--"] args: - while true; do /cuda-samples/vectorAdd; done resources: limits: nvidia.com/gpu: 1 EOF
Verify that all the pods of this deployment have come up on the same already created node and it was able to accommodate them.
Conclusion
The GenAI revolution has changed the landscape of GPU requirements and made being responsible with resource utilization more critical than ever. There are shortcomings to both the approaches outlined here but there is no way around being responsible with GPU costs in the current scenario.