Distributed Python application with Ray

Xin Cheng
5 min readMay 22, 2021

--

In Python world, multiprocessing library has been widely used to parallelize computation in single machine. However, distributed multi-node, multi-process applications are still hard to write. This article talks about how you can easily turn multiprocessing Python application into distributed containerized application with Ray and Kubernetes. This article is inspired by the following articles.

If you are familiar with Python native multiprocessing.Pool to do parallel processing, you will find migrating to ray.util.multiprocessing.Pool quite easy. If you are not familiar, here is a quick pattern.

from multiprocessing import Pool
import time
# tasks to be parallelized
work = (["A", 5], ["B", 2], ["C", 1], ["D", 3])
# worker function
def work_log(work_data):
print(" Process %s waiting %s seconds" % (work_data[0], work_data[1]))
time.sleep(int(work_data[1]))
print(" Process %s Finished." % work_data[0])
# driver
def pool_handler():
p = Pool(2)
# magic here, pass list of tasks, python multiprocessing takes care of parallelization and distribution
p.map(work_log, work)
if __name__ == '__main__':
pool_handler()

It turns out, to convert Python single-machine multiprocessing into distributed Ray Project Pool, you only need to do the following in italics:

# from multiprocessing import Pool
from ray.util.multiprocessing import Pool
import time# tasks to be parallelized
work = (["A", 5], ["B", 2], ["C", 1], ["D", 3])
# worker function
def work_log(work_data):
print(" Process %s waiting %s seconds" % (work_data[0], work_data[1]))
time.sleep(int(work_data[1]))
print(" Process %s Finished." % work_data[0])
# driver
def pool_handler():
# if not using ray cluster, the following line is not needed, by the way, just pool = Pool(ray_address="<ip_address>:<port>") encounters redis.exception, Protocol error
ray.util.connect(<ray_cluster_address>)
# p = Pool(2)
pool = Pool(ray_address="auto")
# magic here, pass list of tasks, python multiprocessing takes care of parallelization and distribution
pool.map(work_log, work)
if __name__ == '__main__':
pool_handler()

I used AWS EKS to run EasyOCR powered by Ray. Here are high-level steps:

  1. Create AWS Elastic Container registry repository for hosting our images
  2. Build docker image to host our code, model and push to ECR
  3. Create EKS
  4. Deploy Ray cluster on EKS and verify
  5. Run distributed OCR app on Ray

Create ECR

e.g. if you want to push hello-world image, create repository named hello-world

Get credential so that we can push docker image to ECR

aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin aws_account_id.dkr.ecr.region.amazonaws.com

public registry

aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws

Build docker and push to ecr

I baked easyocr model, so it does not need to download model during runtime. I find download location at

mkdir -p models
cd models && wget https://github.com/JaidedAI/EasyOCR/releases/download/pre-v1.1.6/craft_mlt_25k.zip
wget https://github.com/JaidedAI/EasyOCR/releases/download/v1.3/english_g2.zip
unzip craft_mlt_25k.zip
unzip english_g2.zip
mkdir -p unzipped
mv craft_mlt_25k.pth unzipped/
mv english_g2.pth unzipped/
# get test images
mkdir -p images && wget -O images/english.png https://github.com/JaidedAI/EasyOCR/raw/master/examples/english.png
cd images && for i in {1..50}; do cp english.png "english$i.png"; done

Here is my Dockerfile

app.py

Build and push

docker built -t aws_account_id.dkr.ecr.region.amazonaws.com/rayocr:0.1 -f Dockerfile .
docker push aws_account_id.dkr.ecr.region.amazonaws.com/rayocr:0.1

Install related tools (terraform, kubectl, aws) and create EKS

I chose t2.2xlarge to give enough resources.

Deploy ray cluster on Kubernetes

Notice ray/values.yaml, operatorImage: rayproject/ray:nightly. If rayproject/ray:latest or rayproject/ray:1.3.0, I encounter “Python ValueError: invalid literal for int() with base 10”

I took shortcut of having EasyOCR dependency in header and worker node. So I modify the image in the following file to my image

Also I change default header and worker resource to have enough (header: 1 CPUs, 8Gi mem; worker: 4 CPUs, 16Gi mem). Otherwise, I encounter “A worker died or was killed while executing task”, due to lack of resource.

Deploy with helm

cd ray/deploy/charts
helm -n ray install example-cluster — create-namespace ./ray

Verify

kubectl -n ray get rayclusters
kubectl -n ray get pods
kubectl -n ray get service
kubectl get deployment ray-operator
kubectl get pod -l cluster.ray.io/component=operator
kubectl get crd rayclusters.cluster.ray.io

Execute sample job

kubectl -n ray create -f https://raw.githubusercontent.com/ray-project/ray/master/doc/kubernetes/job-example.yaml

Grant EKS access to ECR (if using private registry)

Test

Finally you can run Ray EasyOCR now.

kubectl -n ray create -f job.yaml

I see that header node got a few tasks, each worker node of the 2 got 4 processes (related to number of CPUs).

AKS steps

Azure cloud shell already has AZ cli, docker, kubectl, terraform installed, so use it for a quick start.

Create azure cloud shell

Create service principal

You can think of service principal as a service account. It can be used by Azure AD to identify an application.

az ad sp create-for-rbac — skip-assignment

Then you can copy corresponding appId and password according to article below.

Change aks-cluster.tf to following code to provision AKS and ACR.

You can use following code to verify if AKS is created.

kubectl get nodes
kubectl get ns

Build docker image for ACR

echo FROM mcr.microsoft.com/hello-world > Dockerfile
az acr build — image sample/hello-world:v1 \
— registry testacr1011 \
— file Dockerfile .
kubectl create deployment — image=testacr1011.azurecr.io/sample/hello-world:v1 hello-app# ray easyocr
az acr build --image rayeasyocr/rayeasyocr:0.1 --registry testacr1011 --file Dockerfile .

If AKS cannot pull image from ACR, attach AKS to your ACR

az aks update -n <AKS cluster name> -g myResourceGroup — attach-acr <acr-name>

More info

--

--

Xin Cheng
Xin Cheng

Written by Xin Cheng

Multi/Hybrid-cloud, Kubernetes, cloud-native, big data, machine learning, IoT developer/architect, 3x Azure-certified, 3x AWS-certified, 2x GCP-certified

Responses (1)