Huggingface Chat UI — your own ChatGPT part 2
In previous article, we deployed Huggingface Chat UI with Huggingface models. What if the model you want to deploy is not on Huggingface, e.g. locally fine-tuned model, or you want to apply custom logic to model input, output. Fortunately, Chat UI supports models served using Huggingface text generation inference, or even after a custom API.
Text generation inference
Text generation inference: a Rust, Python and gRPC server for text generation inference. Used in production at HuggingFace to power Hugging Chat, the Inference API and Inference Endpoint. It can serve model on Huggingface or model stored on local path. Other features (most important: quantization, token streaming), supported models.
Model on Huggingface
Below is the script to serve “tiiuae/falcon-7b-instruct” model, you need to provide a model volume path to store locally downloaded weights (which can also be used to serve local model)
model=tiiuae/falcon-7b-instruct
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.0.3 --model-id $model
Client
curl 127.0.0.1:8080/generate \
-X POST \
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
-H 'Content-Type: application/json'
Local model
Below is the script to serve local model, we are copying downloaded llama-2–13b-chat model to a local folder, then map local folder “~/serving/text-generation-inference/data” to “/data” in container, then from the container, you can access model through “/data/models/Llama-2–13b-chat-hf” folder. By this way, you can also serve a locally fine-tuned model (you need to provide the complete model, it does not support LORA adapter natively, so you have to merge model first)
mkdir -p ~/serving/text-generation-inference/data/models/Llama-2-13b-chat-hf
cp -rL ~/.cache/huggingface/hub/models--meta-llama--Llama-2-13b-chat-hf/snapshots/<snapshot id>/*.* ~/serving/text-generation-inference/data/models/Llama-2-13b-chat-hf
cd ~/serving/text-generation-inference
# model=meta-llama/Llama-2-13b-chat-hf
model=/data/models/Llama-2-13b-chat-hf
volume=~/serving/text-generation-inference/data # share a volume with the Docker container to avoid downloading weights every run
# token=
docker run --gpus all --shm-size 1g --rm -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.0.3 --model-id $model --quantize bitsandbytes-nf4 --trust-remote-code
Protect text generation inference by reverse proxy
Text generation inference does not provide authentication by default. However, you can use reverse proxy to protect it by adding basic authentication, oauth authentication. A famous reverse proxy is nginx.
Basic authentication
Idea is that text generation inference can be hidden behind a security parameter that outside users cannot directly access (e.g. a port or IP not directly reachable). Then use nginx to forward request to text generation inference.
Text generation inference in port protected (e.g. 9080)
model=tiiuae/falcon-7b-instruct
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
docker run --gpus all --shm-size 1g -p 9080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.0.3 --model-id $model
Install nginx and create user
sudo apt update -y && sudo apt install nginx -y
sudo apt install apache2-utils -y
sudo htpasswd -c /etc/nginx/.htpasswd <username>
Create forwarding rule (exposing 8080 and forward to 9080 port)
sudo vi /etc/nginx/conf.d/mycontainer.conf
server {
listen 8080;
server_name <server name>;
location / {
proxy_pass http://localhost:9080/;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# Basic authentication with username and password
auth_basic "Restricted Access";
auth_basic_user_file /etc/nginx/.htpasswd;
}
}
Client
curl <ip>:8080/generate \
-X POST \
-u <username>:<password> \
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
-H 'Content-Type: application/json'
Appendix
Nginx oauth proxy