LLaMA-2 Fine-tuning
If you are in Generative AI field like me, you must have heard LLaMA, Alpaca, Vicuna. They all derive from Facebook LLaMA model, enabling you to create your own “ChatGPT”. However, they all have one issue, the original LLaMA model is not released for commericial use, hence its descedants. Community has released lots of similar open-source LLM models during last several months, but now finally LLaMA 2 comes out with commercial use.
Certainly I like to try new things when it comes out (and certainly these days there are too many things you cannot try them all). There are lots of blogs, videos about LLaMA 2, but I still encountered roadblocks during my journey, so I would like to document them to so you can avoid them.
The most popular Transfomer model is on Huggingface, so I decided to start from below.
The article has 2 sections to quickly try out: use huggingface transformers library for inference and fine-tuning base model.
Base model Inference
Currently, llama-2 is not publicly downloadable from Hugginface. You need to submit your access request for Meta’s approval, after you login to Hugginface from portal, you need to find the model and submit access request. Then you can do as in the article, login with access token using huggingface-cli. Then you can do inference. It is straightforward.
Fine-tuning
I encountered issues here. The article says about using TRL/transformer reinforcement learning library SFT (supervised fine-tuning) trainer (which is the first step for training LLM). However, on my side, I need to fix a few things.
- I need to add lora_dropout to PEFT’s LoraConfig in trl/examples/scripts/sft_trainer.py
- Here is my environment that does not cause a problem
OS: Ubuntu LTS 22.04
CUDA driver: 12.1
Python: 3.10
PyTorch: no 12.1 package yet, use 11.8, pip3 install torch torchvision torchaudio — index-url https://download.pytorch.org/whl/cu118
Python packages: transformers==4.31.0, bitsandbytes==0.39.0 (0.41.0 version has issue with cuda 12.1), peft==https://github.com/huggingface/peft (installing from pypi.org causing matrix multiplication issue)
3. The script assumes that dataset has a column called “text” containing all instructions. If you write your own training script, you can do more preprocessing, tokenization, etc.
Dataset preparation
For simple instruction tuning training text, generally there are 3 pieces:
- Instruction (e.g. tell me who won most medals in Olympics; summarize the following text)
- Context (e.g. the text you want model to summarize)
- Response (ideal response you want model to generate)
It is easier for model to learn and easier for us to extract response if we separate them using clear separator. Generally I use
### Instruction: <instruction> ### Context: <context> ### Response: <response> ### End
However, the following should also work
### Instruction: <instruction>\n\n### Context: <context>\n\n### Response: <response>\n\n### End
From article below, you can use whatever prompt format with base model. However, for chat model, it is better to follow the prompt format below
<s>[INST] <<SYS>>
{{ system_prompt }}
<</SYS>>
{{ user_message }} [/INST]
Appendix
LLAMA 2 resources: playground, benchmark, prompt llama-2 chat, fine-tuning, deployment
Llama2 is available in Azure Machine Learning and AWS Sagemaker
Other youtuber’s on tuning LLaMA-2 using your GPU