Huggingface Chat UI — your own ChatGPT part 2

Large language model serving and chatbot

Xin Cheng
3 min readSep 18, 2023

In previous article, we deployed Huggingface Chat UI with Huggingface models. What if the model you want to deploy is not on Huggingface, e.g. locally fine-tuned model, or you want to apply custom logic to model input, output. Fortunately, Chat UI supports models served using Huggingface text generation inference, or even after a custom API.

Text generation inference

Text generation inference: a Rust, Python and gRPC server for text generation inference. Used in production at HuggingFace to power Hugging Chat, the Inference API and Inference Endpoint. It can serve model on Huggingface or model stored on local path. Other features (most important: quantization, token streaming), supported models.

Model on Huggingface

Below is the script to serve “tiiuae/falcon-7b-instruct” model, you need to provide a model volume path to store locally downloaded weights (which can also be used to serve local model)

model=tiiuae/falcon-7b-instruct
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.0.3 --model-id $model

Client

curl 127.0.0.1:8080/generate \
-X POST \
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
-H 'Content-Type: application/json'

Local model

Below is the script to serve local model, we are copying downloaded llama-2–13b-chat model to a local folder, then map local folder “~/serving/text-generation-inference/data” to “/data” in container, then from the container, you can access model through “/data/models/Llama-2–13b-chat-hf” folder. By this way, you can also serve a locally fine-tuned model (you need to provide the complete model, it does not support LORA adapter natively, so you have to merge model first)

mkdir -p ~/serving/text-generation-inference/data/models/Llama-2-13b-chat-hf
cp -rL ~/.cache/huggingface/hub/models--meta-llama--Llama-2-13b-chat-hf/snapshots/<snapshot id>/*.* ~/serving/text-generation-inference/data/models/Llama-2-13b-chat-hf
cd ~/serving/text-generation-inference
# model=meta-llama/Llama-2-13b-chat-hf
model=/data/models/Llama-2-13b-chat-hf
volume=~/serving/text-generation-inference/data # share a volume with the Docker container to avoid downloading weights every run
# token=

docker run --gpus all --shm-size 1g --rm -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.0.3 --model-id $model --quantize bitsandbytes-nf4 --trust-remote-code

Protect text generation inference by reverse proxy

Text generation inference does not provide authentication by default. However, you can use reverse proxy to protect it by adding basic authentication, oauth authentication. A famous reverse proxy is nginx.

Basic authentication

Idea is that text generation inference can be hidden behind a security parameter that outside users cannot directly access (e.g. a port or IP not directly reachable). Then use nginx to forward request to text generation inference.

Text generation inference in port protected (e.g. 9080)

model=tiiuae/falcon-7b-instruct
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --gpus all --shm-size 1g -p 9080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.0.3 --model-id $model

Install nginx and create user

sudo apt update -y && sudo apt install nginx -y
sudo apt install apache2-utils -y
sudo htpasswd -c /etc/nginx/.htpasswd <username>

Create forwarding rule (exposing 8080 and forward to 9080 port)

sudo vi /etc/nginx/conf.d/mycontainer.conf
server {
listen 8080;
server_name <server name>;

location / {
proxy_pass http://localhost:9080/;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;

# Basic authentication with username and password
auth_basic "Restricted Access";
auth_basic_user_file /etc/nginx/.htpasswd;
}
}

Client

curl <ip>:8080/generate \
-X POST \
-u <username>:<password> \
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
-H 'Content-Type: application/json'

Appendix

Nginx oauth proxy

--

--

Xin Cheng

Multi/Hybrid-cloud, Kubernetes, cloud-native, big data, machine learning, IoT developer/architect, 3x Azure-certified, 3x AWS-certified, 2x GCP-certified