In Python world, multiprocessing library has been widely used to parallelize computation in single machine. However, distributed multi-node, multi-process applications are still hard to write. This article talks about how you can easily turn multiprocessing Python application into distributed containerized application with Ray and Kubernetes. This article is inspired by the following articles.
If you are familiar with Python native multiprocessing.Pool to do parallel processing, you will find migrating to ray.util.multiprocessing.Pool quite easy. If you are not familiar, here is a quick pattern.
from multiprocessing import Pool
import time# tasks to be parallelized
work = (["A", 5], ["B", 2], ["C", 1], ["D", 3])# worker function
def work_log(work_data):
print(" Process %s waiting %s seconds" % (work_data[0], work_data[1]))
time.sleep(int(work_data[1]))
print(" Process %s Finished." % work_data[0])# driver
def pool_handler():
p = Pool(2)
# magic here, pass list of tasks, python multiprocessing takes care of parallelization and distribution
p.map(work_log, work)if __name__ == '__main__':
pool_handler()
It turns out, to convert Python single-machine multiprocessing into distributed Ray Project Pool, you only need to do the following in italics:
# from multiprocessing import Pool
from ray.util.multiprocessing import Poolimport time# tasks to be parallelized
work = (["A", 5], ["B", 2], ["C", 1], ["D", 3])# worker function
def work_log(work_data):
print(" Process %s waiting %s seconds" % (work_data[0], work_data[1]))
time.sleep(int(work_data[1]))
print(" Process %s Finished." % work_data[0])# driver
def pool_handler():
# if not using ray cluster, the following line is not needed, by the way, just pool = Pool(ray_address="<ip_address>:<port>") encounters redis.exception, Protocol error
ray.util.connect(<ray_cluster_address>) # p = Pool(2)
pool = Pool(ray_address="auto") # magic here, pass list of tasks, python multiprocessing takes care of parallelization and distribution
pool.map(work_log, work)if __name__ == '__main__':
pool_handler()
I used AWS EKS to run EasyOCR powered by Ray. Here are high-level steps:
- Create AWS Elastic Container registry repository for hosting our images
- Build docker image to host our code, model and push to ECR
- Create EKS
- Deploy Ray cluster on EKS and verify
- Run distributed OCR app on Ray
Create ECR
e.g. if you want to push hello-world image, create repository named hello-world
Get credential so that we can push docker image to ECR
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin aws_account_id.dkr.ecr.region.amazonaws.com
public registry
aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws
Build docker and push to ecr
I baked easyocr model, so it does not need to download model during runtime. I find download location at
mkdir -p models
cd models && wget https://github.com/JaidedAI/EasyOCR/releases/download/pre-v1.1.6/craft_mlt_25k.zip
wget https://github.com/JaidedAI/EasyOCR/releases/download/v1.3/english_g2.zip
unzip craft_mlt_25k.zip
unzip english_g2.zip
mkdir -p unzipped
mv craft_mlt_25k.pth unzipped/
mv english_g2.pth unzipped/
# get test images
mkdir -p images && wget -O images/english.png https://github.com/JaidedAI/EasyOCR/raw/master/examples/english.png
cd images && for i in {1..50}; do cp english.png "english$i.png"; done
Here is my Dockerfile
app.py
Build and push
docker built -t aws_account_id.dkr.ecr.region.amazonaws.com/rayocr:0.1 -f Dockerfile .
docker push aws_account_id.dkr.ecr.region.amazonaws.com/rayocr:0.1
Install related tools (terraform, kubectl, aws) and create EKS
I chose t2.2xlarge to give enough resources.
Deploy ray cluster on Kubernetes
Notice ray/values.yaml, operatorImage: rayproject/ray:nightly. If rayproject/ray:latest or rayproject/ray:1.3.0, I encounter “Python ValueError: invalid literal for int() with base 10”
I took shortcut of having EasyOCR dependency in header and worker node. So I modify the image in the following file to my image
Also I change default header and worker resource to have enough (header: 1 CPUs, 8Gi mem; worker: 4 CPUs, 16Gi mem). Otherwise, I encounter “A worker died or was killed while executing task”, due to lack of resource.
Deploy with helm
cd ray/deploy/charts
helm -n ray install example-cluster — create-namespace ./ray
Verify
kubectl -n ray get rayclusters
kubectl -n ray get pods
kubectl -n ray get service
kubectl get deployment ray-operator
kubectl get pod -l cluster.ray.io/component=operator
kubectl get crd rayclusters.cluster.ray.io
Execute sample job
kubectl -n ray create -f https://raw.githubusercontent.com/ray-project/ray/master/doc/kubernetes/job-example.yaml
Grant EKS access to ECR (if using private registry)
Test
Finally you can run Ray EasyOCR now.
kubectl -n ray create -f job.yaml
I see that header node got a few tasks, each worker node of the 2 got 4 processes (related to number of CPUs).
AKS steps
Azure cloud shell already has AZ cli, docker, kubectl, terraform installed, so use it for a quick start.
Create azure cloud shell
Create service principal
You can think of service principal as a service account. It can be used by Azure AD to identify an application.
az ad sp create-for-rbac — skip-assignment
Then you can copy corresponding appId and password according to article below.
Change aks-cluster.tf to following code to provision AKS and ACR.
You can use following code to verify if AKS is created.
kubectl get nodes
kubectl get ns
Build docker image for ACR
echo FROM mcr.microsoft.com/hello-world > Dockerfile
az acr build — image sample/hello-world:v1 \
— registry testacr1011 \
— file Dockerfile .kubectl create deployment — image=testacr1011.azurecr.io/sample/hello-world:v1 hello-app# ray easyocr
az acr build --image rayeasyocr/rayeasyocr:0.1 --registry testacr1011 --file Dockerfile .
If AKS cannot pull image from ACR, attach AKS to your ACR
az aks update -n <AKS cluster name> -g myResourceGroup — attach-acr <acr-name>