Huggingface Chat UI — your own ChatGPT part 3

Large language model serving and chatbot

3 min readOct 8, 2023

This is third article in the series. In 1st and 2nd articles, we discussed how to deploy Huggingface Chat UI with Huggingface model and Huggingface text generation inference with custom models. With that, we can deploy any custom LLM model with Huggingface Chat UI.

Huggingface Chat UI natively supports Huggingface text generation inference, which is exposed as HTTP API endpoint. Below is the information to run it in containerized environment.

Chat UI with text generation inference hosting local LLM

.env.local

endpoints indicate reachable URL to Huggingface text generation inference (here we assume it is hosted on host called textgen)

The model deployed is LLAMA 2 13b. You need to customize to its prompt format.

MONGODB_URL=mongodb://mongo-chatui:27017
MODELS=`[
  {
    "name": "local LLAMA 2 13b chat model",
    "endpoints": [{"url": "http://textgen:80"}],
    "description": "A good alternative to ChatGPT",
    "userMessageToken": "[INST]",
    "assistantMessageToken": "[/INST]",
    "messageEndToken": "</s>",
    "preprompt": "<s>[INST]<<SYS>>\nYou are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.<</SYS>>\n\n[/INST]",
    "parameters": {
      "best_of": 1,
      "decoder_input_details": false,
      "details": false,
      "do_sample": true,
      "return_full_text": false,
      "seed": null,
      "stop": [
        "photographer"
      ],
      "temperature": 0.1,
      "top_p": 0.95,
      "repetition_penalty": 1.2,
      "typical_p": 0.95,
      "watermark": true,
      "top_k": 10,
      "truncate": 1000,
      "max_new_tokens": 1024
    }
  }
]`

docker-compose.yml

Huggingface Chat UI repo contains Dockerfile to build container image. You can use command below to build container image

docker build -t chatui:0.1 -f Dockerfile .

The following docker-compose.yml expose mongodb on mongo-chatui host, Huggingface text generation inference on host textgen (container will use these hostnames to communicate, so hostnames need to be agreed upon in docker compose environment)

services:
  mongo-chatui:
    image: mongo:latest
    ports:
      - "27017:27017"
  chatui:
    image: chatui:0.1
    ports:
      - "<host port for chatui>:3000"
  textgen:
    image: ghcr.io/huggingface/text-generation-inference:1.0.3
    ports:
      - "<host port for textgen>:80"
    command: ["--model-id", "/data/models/Llama-2-13b-chat-hf", "--quantize", "bitsandbytes-nf4", "--trust-remote-code"]
    privileged: true
    shm_size: 1g
    volumes:
     - /home/azureuser/serving/text-generation-inference/data:/data
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

We also need to solve https://github.com/huggingface/chat-ui/issues/364 cross-origin issue

Result

Websearch

Wish

While it is helpful, wish it more natively support retrieval-augmented generation with private data (websearch for private data, with citations), e.g. integration with langchain

Appendix

Mastering Language Models

Navigating the quality-diversity tradeoff with temperature, top-p, top-k, and more

towardsdatascience.com

Explained intuitively important parameters for language models:

Temperature: Language models break down text into tokens, predict the next token in the sequence, and mix in some randomness. Repeat as needed to generate language. Temperature increases diversity but decreases quality by adding randomness to the model’s outputs (by not always picking the “best” token)

Top-k: with top-k sampling, the model filters out truly bad picks and only considers the k best options.

Top-p: vs top-k, only considers those top-ranked tokens whose combined likelihood exceeds the threshold p, throwing out the rest. In practice, top-p sampling tends to give better results compared to top-k sampling.

Frequency penalty: adds a penalty to a token for each time it has occurred in the text. This discourages repeated use of the same tokens/words/phrases and also has the side effect of causing the model to discuss more diverse subject matter and change topics more often.

Presence penalty: is a flat penalty that is applied if a token has already occurred in the text, less discouraging repeated token than frequency penalty