CodeLLama Fine-tuning
CodeLLama is LLama-based model trained on code.
Inference
7b foundation model
Code completion
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch
model_id = "codellama/CodeLlama-7b-hf"
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=quantization_config,
device_map="auto",
)
prompt = 'def remove_non_ascii(s: str) -> str:\n """ '
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
output = model.generate(
inputs["input_ids"],
max_new_tokens=200,
do_sample=True,
top_p=0.9,
temperature=0.1,
)
output = output[0].to("cpu")
print(tokenizer.decode(output))
Result
<s> def remove_non_ascii(s: str) -> str:
"""
Remove non-ASCII characters from a string.
"""
return "".join(i for i in s if ord(i) < 128)
def remove_non_ascii_and_punctuation(s: str) -> str:
"""
Remove non-ASCII characters and punctuation from a string.
"""
return "".join(i for i in s if ord(i) < 128 and not i in string.punctuation)
def remove_non_ascii_and_punctuation_and_whitespace(s: str) -> str:
"""
Remove non-ASCII characters, punctuation, and whitespace from a string.
"""
return "".join(i for i in s if ord(i) < 128 and not
Conversation (don’t work, need to use instruction-tuned model)
Bash task
prompt = 'In Bash, how do I list all text files in the current directory (excluding subdirectories) that have been modified in the last month?'
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
output = model.generate(
inputs["input_ids"],
max_new_tokens=200,
do_sample=True,
top_p=0.9,
temperature=0.1,
)
output = output[0].to("cpu")
print(tokenizer.decode(output))
Result
<s> In Bash, how do I list all text files in the current directory (excluding subdirectories) that have been modified in the last month?
Posted by Bill Karwin (bkarwin) on 2007-08-22T19:05:00.000+0000
Assuming you have GNU find, you can use the -mtime option to find files modified in the last month.
Posted by Bill Karwin (bkarwin) on 2007-08-22T19:06:15.000+0000
I'm not sure what you mean by "list all text files". Do you mean list the names of the files?
Posted by Bill Karwin (bkarwin) on 2007-08-22T19:07:00.000+0000
Python task
system = "Provide answers in Python"
user = "Write a function that computes the set of sums of all contiguous sublists of a given list."
prompt = f"<s>[INST] <<SYS>>\\n{system}\\n<</SYS>>\\n\\n{user}[/INST]"
inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).to("cuda")
output = model.generate(
inputs["input_ids"],
max_new_tokens=200,
do_sample=True,
top_p=0.9,
temperature=0.1,
)
output = output[0].to("cpu")
print(tokenizer.decode(output))
Result
<s> [INST] <<SYS>>\nProvide answers in Python\n<</SYS>>\n\nWrite a function that computes the set of sums of all contiguous sublists of a given list.[/INST]
[INST]<<SYS>>\nProvide answers in Python\n<</SYS>>\n\nWrite a function that computes the set of sums of all contiguous sublists of a given list.[/INST]
[INST]<<SYS>>\nProvide answers in Python\n<</SYS>>\n\nWrite a function that computes the set of sums of all contiguous sublists of a given list.[/INST]
[INST]<<SYS>>\nProvide answers in Python\n<</SYS>>\n\nWrite a function that computes the set of sums of all contiguous sublists of a given list.[/INST]
[INST]<<SYS>>\nProvide answers in Python\n<</SYS>>\n\nWrite a function that computes the set of sums of all cont
7b instruction-tuned model
Code Completion
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch
model_id = "codellama/CodeLlama-7b-Instruct-hf"
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=quantization_config,
device_map="auto",
)
prompt = 'def remove_non_ascii(s: str) -> str:\n """ '
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
output = model.generate(
inputs["input_ids"],
max_new_tokens=200,
do_sample=True,
top_p=0.9,
temperature=0.1,
)
output = output[0].to("cpu")
print(tokenizer.decode(output))
Result
<s> def remove_non_ascii(s: str) -> str:
"""
Remove non-ASCII characters from a string.
Args:
s (str): The string to remove non-ASCII characters from.
Returns:
str: The string with non-ASCII characters removed.
"""
return "".join(c for c in s if ord(c) < 128)
def remove_non_ascii_from_list(l: list) -> list:
"""
Remove non-ASCII characters from a list of strings.
Args:
l (list): The list of strings to remove non-ASCII characters from.
Returns:
list: The list of strings with non-ASCII characters removed.
"""
return [remove_non_ascii(s) for s in l]
def remove_non_ascii_from_
Python task
system = "Provide answers in Python"
user = "Write a function that computes the set of sums of all contiguous sublists of a given list."
prompt = f"<s>[INST] <<SYS>>\\n{system}\\n<</SYS>>\\n\\n{user}[/INST]"
inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).to("cuda")
output = model.generate(
inputs["input_ids"],
max_new_tokens=200,
do_sample=True,
top_p=0.9,
temperature=0.1,
)
output = output[0].to("cpu")
print(tokenizer.decode(output))
Result
<s> [INST] <<SYS>>\nProvide answers in Python\n<</SYS>>\n\nWrite a function that computes the set of sums of all contiguous sublists of a given list.[/INST] ```
def compute_sums(my_list):
return [sum(my_list[i:j]) for i in range(len(my_list)) for j in range(i+1, len(my_list)+1)]
```
This function uses list comprehension to iterate over the indices of the input list, and for each index `i`, it computes the sum of all sublists of length `j` starting from index `i`. The resulting list of sums is returned by the function.
For example, if the input list is `[1, 2, 3, 4, 5]`, the function will return `[1, 3, 6, 10, 15]`.
Note that this function assumes that the input list is a flat list, i.e. it does not contain any nested lists. If the input list contains nested lists, the function may not
Python Fibonnaci task
system = "Provide answers in Python"
user = "Write a function that computes fibonacci series"
prompt = f"<s>[INST] <<SYS>>\\n{system}\\n<</SYS>>\\n\\n{user}[/INST]"
inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).to("cuda")
output = model.generate(
inputs["input_ids"],
max_new_tokens=200,
do_sample=True,
top_p=0.9,
temperature=0.1,
)
output = output[0].to("cpu")
print(tokenizer.decode(output))
Result
<s> [INST] <<SYS>>\nProvide answers in Python\n<</SYS>>\n\nWrite a function that computes fibonacci series[/INST] Here is a function that computes the Fibonacci series:
```
def fibonacci(n):
if n <= 1:
return n
else:
return fibonacci(n-1) + fibonacci(n-2)
```
This function uses a recursive approach to compute the Fibonacci series. It takes an integer `n` as input and returns the `n`-th Fibonacci number. The function first checks if `n` is less than or equal to 1. If it is, the function returns `n`. Otherwise, it calls itself twice with `n-1` and `n-2` as arguments, and then adds the two results together to get the final answer.
For example, if we call the function with `n=5`, it will compute the Fibonacci series as follows:
```
fibonacci(
Javascript task
system = "Provide answers in Javascript"
user = "Write a function that computes the set of sums of all contiguous sublists of a given list."
prompt = f"<s>[INST] <<SYS>>\\n{system}\\n<</SYS>>\\n\\n{user}[/INST]"
inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).to("cuda")
output = model.generate(
inputs["input_ids"],
max_new_tokens=200,
do_sample=True,
top_p=0.9,
temperature=0.1,
)
output = output[0].to("cpu")
print(tokenizer.decode(output))
Result
<s> [INST] <<SYS>>\nProvide answers in Javascript\n<</SYS>>\n\nWrite a function that computes the set of sums of all contiguous sublists of a given list.[/INST] ```
function computeSums(list) {
let sums = [];
for (let i = 0; i < list.length; i++) {
let sum = 0;
for (let j = i; j < list.length; j++) {
sum += list[j];
}
sums.push(sum);
}
return sums;
}
```
This function takes a list as input and returns a list of all the sums of contiguous sublists of the input list.
For example, if the input list is `[1, 2, 3, 4, 5]`, the output list would be `[15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3
Fine-tuning
This notebook tunes codellama-7b to “b-mc2/sql-create-context” dataset, which generate SQL code from question and context, example below:
question
How many heads of the departments are older than 56 ?
context
CREATE TABLE head (age INTEGER)
answer
SELECT COUNT(*) FROM head WHERE age > 56
We need to install latest version of peft, so change following lines
!pip install -U git+https://github.com/huggingface/peft.git
# import locale # colab workaround
# locale.getpreferredencoding = lambda: "UTF-8" # colab workaround
The notebook prepares data to following format
### Input:
{data_point["question"]}
### Context:
{data_point["context"]}
### Response:
{data_point["answer"]}
Use 4-bit quantization to load model (instead 8bit used by the article)
base_model = "codellama/CodeLlama-7b-hf"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
base_model,
quantization_config=bnb_config,
# load_in_8bit=True,
torch_dtype=torch.bfloat16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLlama-7b-hf")