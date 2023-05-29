QLoRA is a new approach to fine-tuning large language models (LLMs) that saves memory without losing speed. QLoRA works by first quantizing the LLM to 4-bits, resulting in a significantly reduced memory footprint for the model. The quantized LLM is then finetuned using the Low Rank Adapters (LoRA) approach. LoRA enables the refined model to preserve the majority of the original LLM’s accuracy while being substantially smaller and quicker.

QLoRA: Efficient Finetuning of Quantized LLMs

QLoRA, or Quantized Low Rank Adapters, is a new approach to fine-tuning large language models (LLMs) that uses less memory while maintaining speed. QLoRA works by first quantizing the LLM to 4-bits, reducing the model’s memory footprint significantly. The quantized LLM is then fine-tuned using the Low Rank Adapters (LoRA) approach. LoRA enables the refined model to preserve the majority of the accuracy of the original LLM while being significantly smaller and quicker.

QLoRA is based on the assumption that the bulk of information in a large language model is contained in the model’s weights, and that the remaining information may be approximated without affecting the model’s accuracy much. QLoRA quantizes the LLM weights to 4-bits, reducing the model’s memory footprint by 8x. The quantized LLM is then finetuned by QLoRA utilizing a method known as Low Rank Adapters (LoRA). LoRA enables the refined model to preserve the majority of the accuracy of the original LLM while being significantly smaller and quicker.

QLoRA has been demonstrated to be effective on a wide range of tasks, including text classification, question answering, and natural language creation. It is an exciting new way to finetuning LLMs that has the potential to make LLMs more accessible to a wider range of users and applications.

QLoRA Demo

Guanaco is a system designed solely for research purposes, and its results may be troublesome.

You may see a live demo here. Please keep in mind that this is the 33B model; the 65B model demo will follow later. With this notebook, you may also host your own Guanaco gradio demo. For 7B and 13B models, it works with free GPUs. Can you tell the difference between ChatGPT and Guanaco? Give it a go! You can access the model response Colab here comparing ChatGPT and Guanaco 65B on Vicuna prompts.

How to install QLoRA

To load models in 4bits with transformers and bitsandbytes, you must install accelerate and transformers from source and have the current version of the bitsandbytes library (0.39.0) installed. You can achieve the above with the following commands:

pip install -q -U bitsandbytes pip install -q -U git+https://github.com/huggingface/transformers.git pip install -q -U git+https://github.com/huggingface/peft.git pip install -q -U git+https://github.com/huggingface/accelerate.git

Getting Started

The qlora.py function may be used to fine-tune and infer on various datasets. The following is a basic command for fine-tuning a baseline model on the Alpaca dataset:

python qlora.py --model_name_or_path <path_or_name>

For models larger than 13B, we recommend adjusting the learning rate:

python qlora.py –learning_rate 0.0001 --model_name_or_path <path_or_name>

Quantization

Quantization parameters are controlled from the BitsandbytesConfig as follows:

Loading in 4 bits is enabled via load_in_4bit .

. The datatype utilized by bnb_4bit_compute_dtype for linear layer calculations.

for linear layer calculations. Nested quantization is enabled via bnb_4bit_use_double_quant .

. bnb_4bit_quant_type specifies the datatype used for quantization. There are two quantization datatypes supported: fp4 (four-bit float) and nf4 (regular four-bit float). We advocate using nf4 since it is theoretically optimum for normally distributed weights.

model = AutoModelForCausalLM.from_pretrained( model_name_or_path='/name/or/path/to/your/model', load_in_4bit=True, device_map='auto', max_memory=max_memory, torch_dtype=torch.bfloat16, quantization_config=BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type='nf4' ), )

Paged Optimizer

You can access the paged optimizer with the argument.

--optim paged_adamw_32bit

Limitations

Inference using four bits is slow. Our 4-bit inference system is currently not connected with 4-bit matrix multiplication. Resuming a LoRA training run using the Trainer presently fails. Using bnb_4bit_compute_type='fp16' at the moment may cause instability. Only 80% of finetuning runs without issue for 7B LLaMA. We have solutions, but they have not yet been implemented in bitsandbytes. To avoid generating difficulties, set tokenizer.bos_token_id to 1 .

