Best GPU for Fine-Tuning LLMs: LoRA, QLoRA, and Full Fine-Tuning Guide

Vishnu Subramanian
Vishnu Subramanian
Founder @JarvisLabs.ai

For QLoRA fine-tuning of 7B-13B models, an RTX 4090 (24GB, $0.59/hr on JarvisLabs) is the sweet spot. For 70B model QLoRA, an A100 80GB ($1.49/hr) is the minimum. For full fine-tuning, you'll need multiple GPUs with FSDP or DeepSpeed — 8x A100 80GB for 70B models. The right GPU depends entirely on your model size and fine-tuning method.

Fine-Tuning Methods Explained

MethodMemory EfficiencyQualityTypical Use Case
Full Fine-TuningLow (most VRAM)HighestPre-training, domain adaptation
LoRAMediumVery GoodTask-specific adaptation
QLoRAHigh (least VRAM)GoodBudget fine-tuning, experimentation
Prefix TuningHighGoodLightweight task adaptation

LoRA (Low-Rank Adaptation) freezes the base model and trains small adapter matrices. Memory scales with adapter size, not model size.

QLoRA quantizes the base model to 4-bit and trains LoRA adapters on top. This dramatically reduces VRAM requirements — you can fine-tune a 70B model on a single 80GB GPU.

Full fine-tuning updates all model parameters. Requires memory for weights + optimizer states (2x for Adam) + gradients + activations. The most expensive but most flexible method.

VRAM Requirements by Model Size

QLoRA Fine-Tuning

Model SizeVRAM NeededRecommended GPU
1B-3B6-8GBAny 8GB+ GPU
7B-8B10-16GBRTX 4090 (24GB)
13B14-20GBRTX 4090 (24GB)
34B24-32GBA100 80GB
70B40-50GBA100 80GB
70B (long context)50-70GBA100 80GB or H100

LoRA Fine-Tuning (FP16 base model)

Model SizeVRAM NeededRecommended GPU
1B-3B10-14GBRTX 4090 (24GB)
7B-8B18-24GBRTX 4090 (24GB) or A100
13B30-40GBA100 80GB
34B70-80GBA100 80GB (tight) or H100
70B140GB+2x A100 80GB or 2x H100

Full Fine-Tuning

Model SizeVRAM NeededMinimum Configuration
1B-3B24-40GB1x RTX 4090 or A100
7B-8B60-100GB1-2x A100 80GB
13B100-160GB2x A100 80GB
34B250-350GB4x A100 80GB
70B400-600GB8x A100 80GB

Full fine-tuning estimates assume AdamW optimizer (2x model size for optimizer states), FP16 mixed precision, and gradient checkpointing. Without gradient checkpointing, activation memory can double these requirements.

GPU Recommendations by Use Case

Best for 7B-13B QLoRA: NVIDIA RTX 4090 (24GB)

For the most common fine-tuning scenario — customizing a 7B or 13B model with QLoRA:

  • 24GB VRAM — fits 7B-13B QLoRA with room for context
  • Fast training — 4th-gen Tensor Cores handle mixed-precision training well
  • $0.59/hr on JarvisLabs — fine-tuning a 7B model for 1,000 steps costs ~$0.30-0.60
  • Most cost-effective for small-to-medium models

A 7B model QLoRA fine-tune (1,000 steps, batch size 4, 512 context) takes roughly 20-40 minutes on an RTX 4090. At $0.59/hr, that's under $0.40 per training run. Check our pricing page.

Best for 70B QLoRA: NVIDIA A100 80GB

QLoRA fine-tuning of 70B models needs the memory headroom:

  • 80GB VRAM — fits 70B QLoRA (4-bit base model ~35GB + LoRA adapters + KV cache)
  • High bandwidth (2.0 TB/s) — good data loading speed for training
  • $1.49/hr on JarvisLabs — a 70B QLoRA run (2,000 steps) takes 2-4 hours ≈ $3-6
  • Multi-GPU available — use 2x A100 80GB for faster training or longer context

Best for Speed: NVIDIA H100 (80GB)

When training time matters:

  • 4th-gen Tensor Cores — fastest mixed-precision training
  • FP8 training support — via Transformer Engine, further speeds up training with minimal quality impact
  • 80GB HBM3 at 3.35 TB/s — faster data movement than A100
  • $2.69/hr on JarvisLabs — ~1.5-2x faster than A100, often cheaper per training run despite higher hourly rate

For a 7B model QLoRA fine-tune, the H100 typically completes 1.5-2x faster than the A100. If your A100 run takes 3 hours ($4.47), an H100 might finish in 1.5-2 hours ($4.03-5.38) — comparable cost but you get your model sooner.

Best for Full Fine-Tuning: Multi-GPU Configurations

Full fine-tuning requires distributed training:

Model SizeConfigurationEstimated Cost*
7B (full FT)2x A100 80GB~$6-12 per run
13B (full FT)2x A100 80GB~$12-24 per run
70B (full FT)8x A100 80GB~$100-300 per run
70B (full FT)8x H100~$100-250 per run (faster)

Rough estimates for 1 epoch over a medium dataset (~10K examples). Actual costs depend on dataset size, context length, batch size, and learning rate schedule.

Use FSDP (Fully Sharded Data Parallelism) or DeepSpeed ZeRO-3 to distribute model weights, optimizer states, and gradients across GPUs. Both are well-supported in the Hugging Face Transformers + TRL ecosystem.

Training Framework Guide

Recommended Stack

For most fine-tuning jobs:

  1. Hugging Face TRL — high-level training library for SFT, DPO, RLHF
  2. PEFT — LoRA and QLoRA implementation
  3. bitsandbytes — 4-bit quantization for QLoRA
  4. Flash Attention 2 — memory-efficient attention (2-4x faster, less VRAM)
  5. Weights & Biases or MLflow — experiment tracking

Key Optimization Techniques

Gradient checkpointing: Trades compute for memory. Re-computes activations during backward pass instead of storing them. Reduces VRAM by 30-60% with ~20% training time overhead. Almost always worth enabling.

Flash Attention 2: Faster and more memory-efficient attention computation. Supports most modern models. Install via

pip install flash-attn
and enable in your training config.

Packing / sample packing: Concatenates short training examples into single sequences to maximize GPU utilization. Reduces padding waste, especially when training examples vary in length. Supported in TRL and Axolotl.

Gradient accumulation: Simulates larger batch sizes without more memory. If batch size 4 fits in VRAM, use batch size 1 with gradient accumulation of 4 for equivalent training dynamics.

Cost Comparison by Model Size

Fine-Tuning ScenarioBest GPUApprox. TimeApprox. Cost
7B QLoRA (1K steps)RTX 409020-40 min$0.20-0.40
7B QLoRA (5K steps)RTX 40902-3 hrs$1.20-1.80
13B QLoRA (1K steps)RTX 409030-60 min$0.30-0.60
13B QLoRA (5K steps)A100 80GB3-5 hrs$4.50-7.50
70B QLoRA (1K steps)A100 80GB1-2 hrs$1.50-3.00
70B QLoRA (5K steps)A100 80GB5-10 hrs$7.50-15.00
7B Full FT (1 epoch, 10K examples)2x A100 80GB2-4 hrs$6.00-12.00
70B Full FT (1 epoch, 10K examples)8x A100 80GB8-20 hrs$95-240

Estimates based on typical training setups. Your costs may vary based on context length, dataset size, and optimization settings. Check JarvisLabs pricing for current rates.

Common Mistakes to Avoid

Using too large a GPU for small models. Don't rent an H100 to QLoRA fine-tune a 7B model. An RTX 4090 handles this at 1/4 the cost.

Skipping gradient checkpointing. The 20% training slowdown is almost always worth the 30-60% VRAM savings. Enable it unless you have VRAM to spare.

Ignoring learning rate. This affects results more than GPU choice. Use cosine schedule with warmup. Start with 2e-4 for QLoRA, 2e-5 for full fine-tuning.

Not evaluating during training. Run evaluations every 100-500 steps. You'll often find that training for fewer steps is sufficient, saving GPU time.

Training too long. More steps doesn't always mean better results. For QLoRA, 1,000-3,000 steps is often sufficient for 1K-10K training examples. Overfitting wastes both time and money.

FAQ

What's the cheapest way to fine-tune a 7B LLM?

QLoRA on an RTX 4090 ($0.59/hr on JarvisLabs). A typical 7B QLoRA training run (1,000 steps) costs under $0.40. Check our pricing page.

Can I fine-tune Llama 70B on a single GPU?

With QLoRA, yes — on an A100 80GB or H100. The 4-bit quantized model (~35GB) plus LoRA adapters fit in 80GB. For full fine-tuning of 70B, you need 8+ GPUs. See our Llama 70B GPU guide for more details.

LoRA vs QLoRA — which should I use?

QLoRA if you're budget-conscious or VRAM-limited — it uses significantly less memory. LoRA (on FP16 base) if you want slightly better quality and have the VRAM. For most practical applications, QLoRA results are close to LoRA.

How long does fine-tuning take?

Depends heavily on model size, dataset, and steps. Rough guide: 7B QLoRA (1K steps) ≈ 20-40 min on RTX 4090. 70B QLoRA (1K steps) ≈ 1-2 hours on A100 80GB. Full fine-tuning is 5-20x longer than QLoRA for the same model.

Do I need ECC memory for fine-tuning?

For short runs (under a few hours), bit errors are unlikely. For long training runs (24+ hours), ECC memory (available on A100, H100) provides protection against memory corruption. For QLoRA experiments, non-ECC GPUs (RTX 4090) are perfectly fine.

Should I fine-tune or use RAG?

Fine-tune when you need the model to learn a specific style, format, or domain knowledge. Use RAG when you need the model to reference specific documents or data that changes frequently. Many production systems use both — fine-tuned model + RAG retrieval.

How much training data do I need?

For QLoRA fine-tuning: 500-10,000 high-quality examples is typical. More data isn't always better — data quality matters more than quantity. For full fine-tuning or continued pre-training, you generally need 10K+ examples to see meaningful improvement.

Build & Deploy Your AI in Minutes

Get started with JarvisLabs today and experience the power of cloud GPU infrastructure designed specifically for AI development.

← Back to FAQs