Hello world to Falcon 180B

4bit quantization and Multi-GPU

Xin Cheng
2 min readSep 13, 2023

TLDR

TII’s Falcon 180B is released to HuggingFace (open access for research and commercial use, pretrained model performs better than Llama2–70b, so interesting to see the performance after fine-tuning). It takes 2 80GB A100 to load with 4bit quantization (more hardware requirements, here we use Azure Standard_NC48ads_A100_v4, interestingly, AWS only offers 8 A100 options which is inflexible, GCP offers flexible options). Inference time is 10+ times slower than llama-2–13b (performance/cost could be issue for enteprise adoption).

To load the model, it requires about 95GB GPU memory. Therefore, you need 2 80GB A100. To load the model, besides pytorch, transfomers library, you need bitsandbytes, accelerate. You need to consent to get access to the model first. Below is the code for inference:

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline
import torch

model_name = "tiiuae/falcon-180B-chat"

tokenizer = AutoTokenizer.from_pretrained(model_name)
# use 4bit quantization
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.float16
)

model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
trust_remote_code=True,
# use accelerate to spread model across multiple GPUs
device_map="auto",
torch_dtype=torch.float16,
)
model.config.use_cache = False

pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, torch_dtype=torch.float16, device_map="auto")

sequences = pipe(
'I liked "Breaking Bad" and "Band of Brothers". Do you have any recommendations of other shows I might like?\n',
do_sample=True,
top_k=10,
temperature=0.5,
num_return_sequences=1,
top_p=0.9,
eos_token_id=tokenizer.eos_token_id,
max_new_tokens=50,
)
for seq in sequences:
print(f"Result: {seq['generated_text']}")

Also community has quantized the model with GPTQ, e.g. TheBloke/Falcon-180B-Chat-GPTQ. With this model, you don’t need to explicitly use 4bit to load it. However, you need transformers 4.33, optimum and auto-gptq installed. Below is the code (you can see we don’t need to explicitly use BitsAndBytesConfig for 4bit, inference is a bit faster than above version):

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model_name = "TheBloke/Falcon-180B-Chat-GPTQ"

# To use a different branch, change revision
# For example: revision="gptq-3bit-128g-actorder_True"
model = AutoModelForCausalLM.from_pretrained(model_name,
device_map="auto",
revision="main")

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

prompt = "Tell me about AI"
prompt_template=f'''User: {prompt}
Assistant: '''

pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
max_new_tokens=50,
temperature=0.7,
do_sample=True,
top_p=0.95,
repetition_penalty=1.15,
device_map="auto"
)

print(pipe(prompt_template)[0]['generated_text'])

Appendix

Falcon 180B

Top GPU capability by cloud providers

The max GPU memory on Azure is 320GB, while 640GB on AWS, GCP.

--

--

Xin Cheng
Xin Cheng

Written by Xin Cheng

Multi/Hybrid-cloud, Kubernetes, cloud-native, big data, machine learning, IoT developer/architect, 3x Azure-certified, 3x AWS-certified, 2x GCP-certified

Responses (1)