QLoRA is a new approach to fine-tuning large language models (LLMs) that saves memory without losing speed. QLoRA works by first quantizing the LLM to 4-bits, resulting in a significantly reduced memory footprint for the model. The quantized LLM is then finetuned using the Low Rank Adapters (LoRA) approach. LoRA enables the refined model to preserve the majority of the original LLM’s accuracy while being substantially smaller and quicker.
Table of Contents
QLoRA Efficient Finetuning of Quantized LLMs
QLoRA, or Quantized Low Rank Adapters, is a new approach to fine-tuning large language models (LLMs) that uses less memory while maintaining speed. QLoRA works by first quantizing the LLM to 4-bits, reducing the model’s memory footprint significantly. The quantized LLM is then fine-tuned using the Low Rank Adapters (LoRA) approach. LoRA enables the refined model to preserve the majority of the accuracy of the original LLM while being significantly smaller and quicker.
QLoRA is based on the assumption that the bulk of information in a large language model is contained in the model’s weights, and that the remaining information may be approximated without affecting the model’s accuracy much. QLoRA quantizes the LLM weights to 4-bits, reducing the model’s memory footprint by 8x. The quantized LLM is then finetuned by QLoRA utilizing a method known as Low Rank Adapters (LoRA). LoRA enables the refined model to preserve the majority of the accuracy of the original LLM while being significantly smaller and quicker.
QLoRA has been demonstrated to be effective on a wide range of tasks, including text classification, question answering, and natural language creation. It is an exciting new way to finetuning LLMs that has the potential to make LLMs more accessible to a wider range of users and applications.
Fine-tuning a GPT model with QLoRa
Hardware Requirements for QLoRa:
- GPU: For models with fewer than 20 billion parameters, such as GPT-J, a GPU with at least 12 GB of VRAM is suggested. An RTX 3060 12 GB GPU, for example, can be utilized. If you have a bigger GPU with 24 GB of VRAM, you can use a model with 20 billion parameters, such as GPT-NeoX-20b.
- RAM: It is suggested that you have at least 6 GB of RAM. This criteria is met by the majority of current computers.
- Hard Drive: Because the GPT-J and GPT-NeoX-20b are large models, you need have at least 80 GB of free space on your hard disk.
If your system does not satisfy these criteria, you can utilize Google Colab’s free instance instead.
Software Requirements for QLoRa:
- CUDA: Make sure CUDA is installed on your machine.
- Dependencies:
- bitsandbytes: This library contains all the necessary tools for quantizing a large language model (LLM).
- Hugging Face Transformers and Accelerate: These standard libraries are used for efficient model training from the Hugging Face Hub.
- PEFT: This library provides implementations for various methods to fine-tune a small number of extra model parameters. It is required for LoRa.
- Datasets: Although not mandatory, the Datasets library can be used to obtain a dataset for fine-tuning. Alternatively, you can provide your own dataset.
Ensure that all the required software dependencies are installed before proceeding with QLoRa-based fine-tuning of GPT models.
QLoRA Demo
Guanaco is a system designed solely for research purposes, and its results may be troublesome.
- You may see a live demo here. Please keep in mind that this is the 33B model; the 65B model demo will follow later.
- With this notebook, you may also host your own Guanaco gradio demo. For 7B and 13B models, it works with free GPUs.
- Can you tell the difference between ChatGPT and Guanaco? Give it a go! You can access the model response Colab here comparing ChatGPT and Guanaco 65B on Vicuna prompts.
Also read Tinygrad: Revolutionizing Deep Learning with Lightweight Efficiency
How to install QLoRA
To load models in 4bits with transformers and bitsandbytes, you must install accelerate and transformers from source and have the current version of the bitsandbytes library (0.39.0) installed. You can achieve the above with the following commands:
pip install -q -U bitsandbytes
pip install -q -U git+https://github.com/huggingface/transformers.git
pip install -q -U git+https://github.com/huggingface/peft.git
pip install -q -U git+https://github.com/huggingface/accelerate.git
Getting Started
The qlora.py
function may be used to fine-tune and infer on various datasets. The following is a basic command for fine-tuning a baseline model on the Alpaca dataset:
python qlora.py --model_name_or_path <path_or_name>
For models larger than 13B, we recommend adjusting the learning rate:
python qlora.py –learning_rate 0.0001 --model_name_or_path <path_or_name>
Quantization
Quantization parameters are controlled from the BitsandbytesConfig
as follows:
- Loading in 4 bits is enabled via
load_in_4bit
. - The datatype utilized by
bnb_4bit_compute_dtype
for linear layer calculations. - Nested quantization is enabled via
bnb_4bit_use_double_quant
. bnb_4bit_quant_type
specifies the datatype used for quantization. There are two quantization datatypes supported:fp4
(four-bit float) andnf4
(regular four-bit float). We advocate usingnf4
since it is theoretically optimum for normally distributed weights.
model = AutoModelForCausalLM.from_pretrained(
model_name_or_path='/name/or/path/to/your/model',
load_in_4bit=True,
device_map='auto',
max_memory=max_memory,
torch_dtype=torch.bfloat16,
quantization_config=BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4'
),
)
Paged Optimizer
You can access the paged optimizer with the argument.
--optim paged_adamw_32bit
Limitations
- Inference using four bits is slow. Our 4-bit inference system is currently not connected with 4-bit matrix multiplication.
- Resuming a LoRA training run using the Trainer presently fails.
- Using
bnb_4bit_compute_type='fp16'
at the moment may cause instability. Only 80% of finetuning runs without issue for 7B LLaMA. We have solutions, but they have not yet been implemented in bitsandbytes. - To avoid generating difficulties, set
tokenizer.bos_token_id to 1
.
This article is to help you learn about QLoRA: Efficient Finetuning of Quantized LLMs. We trust that it has been helpful to you. Please feel free to share your thoughts and feedback in the comment section below.