Hello world to Falcon 180B
TLDR
TII’s Falcon 180B is released to HuggingFace (open access for research and commercial use, pretrained model performs better than Llama2–70b, so interesting to see the performance after fine-tuning). It takes 2 80GB A100 to load with 4bit quantization (more hardware requirements, here we use Azure Standard_NC48ads_A100_v4, interestingly, AWS only offers 8 A100 options which is inflexible, GCP offers flexible options). Inference time is 10+ times slower than llama-2–13b (performance/cost could be issue for enteprise adoption).
To load the model, it requires about 95GB GPU memory. Therefore, you need 2 80GB A100. To load the model, besides pytorch, transfomers library, you need bitsandbytes, accelerate. You need to consent to get access to the model first. Below is the code for inference:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline
import torch
model_name = "tiiuae/falcon-180B-chat"
tokenizer = AutoTokenizer.from_pretrained(model_name)
# use 4bit quantization
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.float16
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
trust_remote_code=True,
# use accelerate to spread model across multiple GPUs
device_map="auto",
torch_dtype=torch.float16,
)
model.config.use_cache = False
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, torch_dtype=torch.float16, device_map="auto")
sequences = pipe(
'I liked "Breaking Bad" and "Band of Brothers". Do you have any recommendations of other shows I might like?\n',
do_sample=True,
top_k=10,
temperature=0.5,
num_return_sequences=1,
top_p=0.9,
eos_token_id=tokenizer.eos_token_id,
max_new_tokens=50,
)
for seq in sequences:
print(f"Result: {seq['generated_text']}")
Also community has quantized the model with GPTQ, e.g. TheBloke/Falcon-180B-Chat-GPTQ. With this model, you don’t need to explicitly use 4bit to load it. However, you need transformers 4.33, optimum and auto-gptq installed. Below is the code (you can see we don’t need to explicitly use BitsAndBytesConfig for 4bit, inference is a bit faster than above version):
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
model_name = "TheBloke/Falcon-180B-Chat-GPTQ"
# To use a different branch, change revision
# For example: revision="gptq-3bit-128g-actorder_True"
model = AutoModelForCausalLM.from_pretrained(model_name,
device_map="auto",
revision="main")
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
prompt = "Tell me about AI"
prompt_template=f'''User: {prompt}
Assistant: '''
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
max_new_tokens=50,
temperature=0.7,
do_sample=True,
top_p=0.95,
repetition_penalty=1.15,
device_map="auto"
)
print(pipe(prompt_template)[0]['generated_text'])
Appendix
Falcon 180B
Top GPU capability by cloud providers
The max GPU memory on Azure is 320GB, while 640GB on AWS, GCP.