Use GPU in docker container and Kubernetes/OpenShift

7 min readFeb 18, 2021

Using GPU on managed machine learning platform (e.g. Azure machine learning, Amazon Sagemaker, Google Cloud AI Platform) is easy, as lots of details are abstracted away from. For example, when you use their machine learning SDK, your machine learning code is usually packaged into a docker container with the supporting machine learning framework and executed on the platform, and usually you don’t need to build the container yourself, since the platform does this all for you. However, if you don’t have access to these platforms, but still want to leverage modern container approach in your machine learning development lifecycle, you need to know a bit more how code from within a container can leverage the GPU installed on the host machine.

NVIDIA AI GPUs and CUDA are the most popular in this domain, so we will focus on them now. The following article actually explains the process on docker very well. Before continuing, read the article first. Below I add my understanding as a complement to help you understand better.

How to Properly Use the GPU within a Docker Container

In this post, we walk through the steps required to access your machine’s GPU within a Docker container.

towardsdatascience.com

Run on docker

You don’t need to install NVIDIA CUDA driver in docker image. GPU is a specific hardware, and usually you need to install the driver for it (like printer driver). You can do it (it is called brute force approach in above article). However, the challenges are that there are lots of different GPU, installing a specific GPU driver means the docker image is bound to the GPU the driver is targeted for, which increases management burden and defeats the purpose of portability. In addition, when I want to build a NVIDIA CUDA myself on a server without GPU, it simply fails. From the following image, CUDA driver is installed on host OS, while docker image only contains CUDA toolkit. This keeps docker image CUDA-driver agnostic, making it more portable and stable.

Installation Guide - NVIDIA Cloud Native Technologies documentation

Before you get started, make sure you have installed the NVIDIA driver for your Linux distribution. The recommended way…

docs.nvidia.com

How to enable NVIDIA GPUs in containers on bare metal in RHEL 8

This post updates previous posts by Zvonko Kaiser about using the nvidia-container-runtime-hook to access GPUs in…

www.redhat.com

2. However, when your container runs, it still needs CUDA driver to properly access GPU on host server. To do this, NVIDIA provides container runtime library and NVIDIA docker, when you use NVIDIA docker to launch container image, it automatically configures containers to leverage NVIDIA GPUs installed on the host OS. The difference between plain docker command is just to add “ — gpus <number of gpus or all>”.

3. Usually you would use pytorch or tensorflow image from docker hub to build your custom image. Make sure the version is compiled with the correct NVIDIA CUDA version on host OS, otherwise, code in container won’t recognize GPUs on host OS. To check NVIDIA CUDA version, run nvidia-smi on host OS, you will see “CUDA version”, then in docker hub, you need to look for tag that’s containg cuda<CUDA version on host OS>

For example, host OS nvidia-smi result

+ — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — -+

| NVIDIA-SMI 418.165.02 Driver Version: 418.165.02 CUDA Version: 10.1 |

| — — — — — — — — — — — — — — — -+ — — — — — — — — — — — + — — — — — — — — — — — +

In docker hub, look for cuda10.1, e.g. 1.6.0-cuda10.1-cudnn7-runtime, 1.6.0-cuda10.1-cudnn7-devel.

Run on Kubernetes/OpenShift

Kubernetes/OpenShift has a different layer, scheduling layer. The cluster may contain GPU nodes and non-GPU nodes, if you want to run GPU code, you don’t want your Pod to be scheduled on GPU nodes. So on cluster side, there is something called “device plugin” to let Pods access specialized hardware features such as GPUs. On OpenShift, NVIDIA GPU Operator further automates the procedure.

Schedule GPUs

Configure and schedule GPUs for use as a resource by nodes in a cluster. FEATURE STATE: Kubernetes v1.10 [beta]…

kubernetes.io

OpenShift on NVIDIA GPU Accelerated Clusters

This document serves as a guide to installing Red Hat OpenShift 4.4.29+, 4.5, and 4.6 with GPU-accelerated RHCOS Worker…

docs.nvidia.com

NVIDIA GPU Operator: Simplifying GPU Management in Kubernetes | NVIDIA Developer Blog

Over the last few years, NVIDIA has leveraged GPU containers in a variety of ways for testing, development and running…

developer.nvidia.com

How docker container uses GPU, pretty simple and across OpenShift, AKS, EKS, GKE.

resources:
    limits:
        nvidia.com/gpu: 1 # requesting 1 GPU

NVIDIA GPU Operator with OpenShift 4.3 on Red Hat OpenStack Platform 13

The NVIDIA GPU Operator has been available as a Beta since 2020, Jan 27, it's a technical preview release…

egallen.com

Use GPUs on Azure Kubernetes Service (AKS) - Azure Kubernetes Service

Graphical processing units (GPUs) are often used for compute-intensive workloads such as graphics and visualization…

docs.microsoft.com

Virtual GPU device plugin for inference workloads in Kubernetes | Amazon Web Services

Machine learning (ML) has become a centerpiece for enterprise transformation. AWS provides a broad and deep set of ML…

aws.amazon.com

Running GPUs | Kubernetes Engine Documentation | Google Cloud

This page shows you how to use NVIDIA® graphics processing unit (GPU) hardware accelerators in your Google Kubernetes…

cloud.google.com

How to install NVIDIA GPU Operator with A100 on Kubernetes base Rocky linux

In the meantime, how to configure nvidia gpu in kubernetes was very tricky. To provision a simple GPU pod, we had to…

awslife.medium.com

Reference

In case you want to build NVIDIA CUDA, pytorch, tensorflow docker image yourself, use the following as start.

NVIDIA CUDA Dockerfile

dist/11.2.1/centos8-x86_64/runtime/Dockerfile · master · nvidia / container-images / cuda

GitLab.com

gitlab.com

some version hack

Make sure LD_LIBRARY_PATH includes cuda, cudnn libraries (otherwise, pytorch, tensorflow may not be able to detect GPU within the container). Also different version of tensorflow assumes version-specific CUDA file name, symlink trick is a not suggested but a quick workaround (although I hope industry can have a better solution than this symlink hack and recompilation for easy and wider use).

Pytorch Dockerfile

pytorch/pytorch

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

Tensorflow Dockerfile

tensorflow/tensorflow

An Open Source Machine Learning Framework for Everyone - tensorflow/tensorflow

github.com

Also, if you are in a constrained environment and cannot access those public docker hub images, you may need to build image yourself. In this situation, notice the CUDA version and deep learning framework version compatibility. I find pytorch version compatibility seems to be better and more explicit.

PyTorch

We'd prefer you install the latest version, but old binaries and installation instructions are provided below for your…

pytorch.org

Tensorflow 1.x seems to be pickier but 2.x seems to improve.

Build from source | TensorFlow

Build a TensorFlow pip package from source and install it on Ubuntu Linux and macOS. While the instructions might work…

www.tensorflow.org

For example, in page above, tensorflow-gpu-1.15.0 is tested on CUDA 10.0. When CUDA version is 10.2, you have to compile from source (otherwise, it won’t find necessary CUDA files). Here is one good article:

How To Compile TensorFlow 2.3 with CUDA 11.1

Use as your last resort for TensorFlow installation when nothing works

towardsdatascience.com

However, when I use tensorflow 2.3.1, it can work with CUDA 10.2, although the page says only tested on 10.1 (which means CUDA 10.1 is compatible with 10.2?). If you start a new tensorflow project, start with 2.x, which could save you lots of pain.

Host OS CUDA version and docker image CUDA version compatibility

CUDA claims backward compatibility, meaning that applications compiled against a particular version of the CUDA will continue to work on subsequent (later) driver releases.

Release Notes :: CUDA Toolkit Documentation

The Release Notes for the CUDA Toolkit. CUDA Components Starting with CUDA 11, the various components in the toolkit…

docs.nvidia.com

CUDA Compatibility

CUDA Compatibility document describes the use of new CUDA toolkit components on systems with older base installations…

docs.nvidia.com

Here is quick test:

nvidia-smi all shows 11.0 within container
pytorch
pytorch/pytorch:1.8.1-cuda11.1-cudnn8-runtime
pytorch/pytorch:1.6.0-cuda10.1-cudnn7-runtime
host cuda 11.0, container: 11.1, torch.cuda.is_available() is True
host cuda 11.0, container: 10.1, torch.cuda.is_available() is True

tensorflow/tensorflow:2.4.1-gpu
tf.test.is_gpu_available() returns True, cuda 11.0 in /usr/local
tensorflow/tensorflow:2.3.2-gpu
tf.test.is_gpu_available() returns True, cuda 10.1 in /usr/local

However, when CUDA driver is 11.2, docker image with 10.2 CUDA toolkit torch.cuda.is_available() returns False with pytorch 1.6.0 (update: after LD_LIBRARY_PATH trick to include cuda, cudnn libraries, it returns True; more update: simpletransformers model encounters Unable to write to file </torch_> error, it seems pytorch is more shared-memory-eager now, shared memory on openshift trick, but kubernetes is still open)

Also, in article below, it mentions no GPU improvement for tesseract, while easyOCR (using pytorch backend) sees amazing improvement. In our test, GPU speeds up about 7 times on average. So there are multiple OCR choices and could be some tradeoff depending on the use case.

OCR Engine Comparison — Tesseract vs. EasyOCR

Intro: Optical Character Recognition (OCR) becomes more popular as document digitalization evolves. More and more…

medium.com

Best GPU for deep learning

How to choose the best GPU optimized VM sizes for your project on Azure

A common problem that data scientists face when training and deploying their machine learning models is the choice of…

techcommunity.microsoft.com

A quick start to benchmarking in Azure: NVIDIA Deep Learning Examples on the NC-series

Main contributor: Hugo Affaticati, Program Manager Blog post authors: Aimee Garcia, Program Manager and Sonal Doomra…

techcommunity.microsoft.com

Use GPU in docker container and Kubernetes/OpenShift

How to Properly Use the GPU within a Docker Container

In this post, we walk through the steps required to access your machine’s GPU within a Docker container.

Run on docker

Installation Guide - NVIDIA Cloud Native Technologies documentation

Before you get started, make sure you have installed the NVIDIA driver for your Linux distribution. The recommended way…

How to enable NVIDIA GPUs in containers on bare metal in RHEL 8

This post updates previous posts by Zvonko Kaiser about using the nvidia-container-runtime-hook to access GPUs in…

Run on Kubernetes/OpenShift

Schedule GPUs

Configure and schedule GPUs for use as a resource by nodes in a cluster. FEATURE STATE: Kubernetes v1.10 [beta]…

OpenShift on NVIDIA GPU Accelerated Clusters

This document serves as a guide to installing Red Hat OpenShift 4.4.29+, 4.5, and 4.6 with GPU-accelerated RHCOS Worker…

NVIDIA GPU Operator: Simplifying GPU Management in Kubernetes | NVIDIA Developer Blog

Over the last few years, NVIDIA has leveraged GPU containers in a variety of ways for testing, development and running…

NVIDIA GPU Operator with OpenShift 4.3 on Red Hat OpenStack Platform 13

The NVIDIA GPU Operator has been available as a Beta since 2020, Jan 27, it's a technical preview release…

Use GPUs on Azure Kubernetes Service (AKS) - Azure Kubernetes Service

Graphical processing units (GPUs) are often used for compute-intensive workloads such as graphics and visualization…

Virtual GPU device plugin for inference workloads in Kubernetes | Amazon Web Services

Machine learning (ML) has become a centerpiece for enterprise transformation. AWS provides a broad and deep set of ML…

Running GPUs | Kubernetes Engine Documentation | Google Cloud

This page shows you how to use NVIDIA® graphics processing unit (GPU) hardware accelerators in your Google Kubernetes…

How to install NVIDIA GPU Operator with A100 on Kubernetes base Rocky linux

In the meantime, how to configure nvidia gpu in kubernetes was very tricky. To provision a simple GPU pod, we had to…

dist/11.2.1/centos8-x86_64/runtime/Dockerfile · master · nvidia / container-images / cuda

GitLab.com

pytorch/pytorch

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

tensorflow/tensorflow

An Open Source Machine Learning Framework for Everyone - tensorflow/tensorflow

PyTorch

We'd prefer you install the latest version, but old binaries and installation instructions are provided below for your…

Build from source | TensorFlow

Build a TensorFlow pip package from source and install it on Ubuntu Linux and macOS. While the instructions might work…

How To Compile TensorFlow 2.3 with CUDA 11.1

Use as your last resort for TensorFlow installation when nothing works

Host OS CUDA version and docker image CUDA version compatibility

Release Notes :: CUDA Toolkit Documentation

The Release Notes for the CUDA Toolkit. CUDA Components Starting with CUDA 11, the various components in the toolkit…

CUDA Compatibility

CUDA Compatibility document describes the use of new CUDA toolkit components on systems with older base installations…

OCR Engine Comparison — Tesseract vs. EasyOCR

Intro: Optical Character Recognition (OCR) becomes more popular as document digitalization evolves. More and more…

How to choose the best GPU optimized VM sizes for your project on Azure

A common problem that data scientists face when training and deploying their machine learning models is the choice of…

A quick start to benchmarking in Azure: NVIDIA Deep Learning Examples on the NC-series

Main contributor: Hugo Affaticati, Program Manager Blog post authors: Aimee Garcia, Program Manager and Sonal Doomra…

Written by Xin Cheng