bitsandbytes (QLoRA)

VRAM:~75% with 4-bit

TL;DR

Dynamic quantization for training and inference. Enables 4-bit QLoRA fine-tuning on consumer GPUs.

Use when

  • +Fine-tuning with limited VRAM
  • +Quick inference testing
  • +Using HuggingFace ecosystem

Skip when

  • -Production inference (use GPTQ/AWQ)
  • -Non-NVIDIA hardware

bitsandbytes provides 8-bit and 4-bit quantization for PyTorch models. Most notably used for QLoRA training, allowing fine-tuning of large models on consumer hardware.

Features

- **8-bit Optimizers**: Reduce optimizer memory by 75% - **4-bit NormalFloat (NF4)**: Optimal 4-bit data type for normally-distributed weights - **Double Quantization**: Further memory reduction for QLoRA

Key Benefits

- **Training Support**: Only method enabling 4-bit fine-tuning - **Easy Integration**: Works with HuggingFace Transformers - **Dynamic Quantization**: No pre-quantized model needed

Code Examples

Load 4-bit model with bitsandbytespython
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="float16"
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config
)