bitsandbytes (QLoRA)
VRAM:~75% with 4-bit
TL;DR
Dynamic quantization for training and inference. Enables 4-bit QLoRA fine-tuning on consumer GPUs.
Use when
- +Fine-tuning with limited VRAM
- +Quick inference testing
- +Using HuggingFace ecosystem
Skip when
- -Production inference (use GPTQ/AWQ)
- -Non-NVIDIA hardware
bitsandbytes provides 8-bit and 4-bit quantization for PyTorch models. Most notably used for QLoRA training, allowing fine-tuning of large models on consumer hardware.
Features
- **8-bit Optimizers**: Reduce optimizer memory by 75% - **4-bit NormalFloat (NF4)**: Optimal 4-bit data type for normally-distributed weights - **Double Quantization**: Further memory reduction for QLoRA
Key Benefits
- **Training Support**: Only method enabling 4-bit fine-tuning - **Easy Integration**: Works with HuggingFace Transformers - **Dynamic Quantization**: No pre-quantized model needed
Code Examples
Load 4-bit model with bitsandbytespython
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype="float16"
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config
)