Hello world to Falcon 180B

4bit quantization and Multi-GPU

2 min readSep 13, 2023

TLDR

TII’s Falcon 180B is released to HuggingFace (open access for research and commercial use, pretrained model performs better than Llama2–70b, so interesting to see the performance after fine-tuning). It takes 2 80GB A100 to load with 4bit quantization (more hardware requirements, here we use Azure Standard_NC48ads_A100_v4, interestingly, AWS only offers 8 A100 options which is inflexible, GCP offers flexible options). Inference time is 10+ times slower than llama-2–13b (performance/cost could be issue for enteprise adoption).

Spread Your Wings: Falcon 180B is here

We're on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

To load the model, it requires about 95GB GPU memory. Therefore, you need 2 80GB A100. To load the model, besides pytorch, transfomers library, you need bitsandbytes, accelerate. You need to consent to get access to the model first. Below is the code for inference:

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline
import torch

model_name = "tiiuae/falcon-180B-chat"

tokenizer = AutoTokenizer.from_pretrained(model_name)
# use 4bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    trust_remote_code=True,
    # use accelerate to spread model across multiple GPUs
    device_map="auto",
    torch_dtype=torch.float16,
)
model.config.use_cache = False

pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, torch_dtype=torch.float16, device_map="auto")

sequences = pipe(
    'I liked "Breaking Bad" and "Band of Brothers". Do you have any recommendations of other shows I might like?\n',
    do_sample=True,
    top_k=10,
    temperature=0.5,
    num_return_sequences=1,
    top_p=0.9,
    eos_token_id=tokenizer.eos_token_id,
    max_new_tokens=50,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

Also community has quantized the model with GPTQ, e.g. TheBloke/Falcon-180B-Chat-GPTQ. With this model, you don’t need to explicitly use 4bit to load it. However, you need transformers 4.33, optimum and auto-gptq installed. Below is the code (you can see we don’t need to explicitly use BitsAndBytesConfig for 4bit, inference is a bit faster than above version):

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model_name = "TheBloke/Falcon-180B-Chat-GPTQ"

# To use a different branch, change revision
# For example: revision="gptq-3bit-128g-actorder_True"
model = AutoModelForCausalLM.from_pretrained(model_name,
                                             device_map="auto",
                                             revision="main")

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

prompt = "Tell me about AI"
prompt_template=f'''User: {prompt}
Assistant: '''

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=50,
    temperature=0.7,
    do_sample=True,
    top_p=0.95,
    repetition_penalty=1.15,
    device_map="auto"
)

print(pipe(prompt_template)[0]['generated_text'])

Appendix

Falcon 180B

Comparing Falcon-180B to GPT4

A Quick & Crude Side-By-Side

medium.com

Top GPU capability by cloud providers

ND A100 v4-series - Azure Virtual Machines

Specifications for the ND A100 v4-series VMs.

learn.microsoft.com

The max GPU memory on Azure is 320GB, while 640GB on AWS, GCP.

Amazon EC2 P4d Instances - AWS

High performance for ML training and HPC applications in the cloud Amazon Elastic Compute Cloud (Amazon EC2) P4d…

aws.amazon.com

GPU platforms | Compute Engine Documentation | Google Cloud

Whether your business is early in its journey or well on its way to digital transformation, Google Cloud can help solve…

cloud.google.com