Best GPU for Fine-Tuning LLMs: LoRA, QLoRA, and Full Fine-Tuning Guide
For QLoRA fine-tuning of 7B-13B models, an RTX 4090 (24GB, $0.59/hr on JarvisLabs) is the sweet spot. For 70B model QLoRA, an A100 80GB ($1.49/hr) is the minimum. For full fine-tuning, you'll need multiple GPUs with FSDP or DeepSpeed — 8x A100 80GB for 70B models. The right GPU depends entirely on your model size and fine-tuning method.
Fine-Tuning Methods Explained
| Method | Memory Efficiency | Quality | Typical Use Case |
|---|---|---|---|
| Full Fine-Tuning | Low (most VRAM) | Highest | Pre-training, domain adaptation |
| LoRA | Medium | Very Good | Task-specific adaptation |
| QLoRA | High (least VRAM) | Good | Budget fine-tuning, experimentation |
| Prefix Tuning | High | Good | Lightweight task adaptation |
LoRA (Low-Rank Adaptation) freezes the base model and trains small adapter matrices. Memory scales with adapter size, not model size.
QLoRA quantizes the base model to 4-bit and trains LoRA adapters on top. This dramatically reduces VRAM requirements — you can fine-tune a 70B model on a single 80GB GPU.
Full fine-tuning updates all model parameters. Requires memory for weights + optimizer states (2x for Adam) + gradients + activations. The most expensive but most flexible method.
VRAM Requirements by Model Size
QLoRA Fine-Tuning
| Model Size | VRAM Needed | Recommended GPU |
|---|---|---|
| 1B-3B | 6-8GB | Any 8GB+ GPU |
| 7B-8B | 10-16GB | RTX 4090 (24GB) |
| 13B | 14-20GB | RTX 4090 (24GB) |
| 34B | 24-32GB | A100 80GB |
| 70B | 40-50GB | A100 80GB |
| 70B (long context) | 50-70GB | A100 80GB or H100 |
LoRA Fine-Tuning (FP16 base model)
| Model Size | VRAM Needed | Recommended GPU |
|---|---|---|
| 1B-3B | 10-14GB | RTX 4090 (24GB) |
| 7B-8B | 18-24GB | RTX 4090 (24GB) or A100 |
| 13B | 30-40GB | A100 80GB |
| 34B | 70-80GB | A100 80GB (tight) or H100 |
| 70B | 140GB+ | 2x A100 80GB or 2x H100 |
Full Fine-Tuning
| Model Size | VRAM Needed | Minimum Configuration |
|---|---|---|
| 1B-3B | 24-40GB | 1x RTX 4090 or A100 |
| 7B-8B | 60-100GB | 1-2x A100 80GB |
| 13B | 100-160GB | 2x A100 80GB |
| 34B | 250-350GB | 4x A100 80GB |
| 70B | 400-600GB | 8x A100 80GB |
Full fine-tuning estimates assume AdamW optimizer (2x model size for optimizer states), FP16 mixed precision, and gradient checkpointing. Without gradient checkpointing, activation memory can double these requirements.
GPU Recommendations by Use Case
Best for 7B-13B QLoRA: NVIDIA RTX 4090 (24GB)
For the most common fine-tuning scenario — customizing a 7B or 13B model with QLoRA:
- 24GB VRAM — fits 7B-13B QLoRA with room for context
- Fast training — 4th-gen Tensor Cores handle mixed-precision training well
- $0.59/hr on JarvisLabs — fine-tuning a 7B model for 1,000 steps costs ~$0.30-0.60
- Most cost-effective for small-to-medium models
A 7B model QLoRA fine-tune (1,000 steps, batch size 4, 512 context) takes roughly 20-40 minutes on an RTX 4090. At $0.59/hr, that's under $0.40 per training run. Check our pricing page.
Best for 70B QLoRA: NVIDIA A100 80GB
QLoRA fine-tuning of 70B models needs the memory headroom:
- 80GB VRAM — fits 70B QLoRA (4-bit base model ~35GB + LoRA adapters + KV cache)
- High bandwidth (2.0 TB/s) — good data loading speed for training
- $1.49/hr on JarvisLabs — a 70B QLoRA run (2,000 steps) takes 2-4 hours ≈ $3-6
- Multi-GPU available — use 2x A100 80GB for faster training or longer context
Best for Speed: NVIDIA H100 (80GB)
When training time matters:
- 4th-gen Tensor Cores — fastest mixed-precision training
- FP8 training support — via Transformer Engine, further speeds up training with minimal quality impact
- 80GB HBM3 at 3.35 TB/s — faster data movement than A100
- $2.69/hr on JarvisLabs — ~1.5-2x faster than A100, often cheaper per training run despite higher hourly rate
For a 7B model QLoRA fine-tune, the H100 typically completes 1.5-2x faster than the A100. If your A100 run takes 3 hours ($4.47), an H100 might finish in 1.5-2 hours ($4.03-5.38) — comparable cost but you get your model sooner.
Best for Full Fine-Tuning: Multi-GPU Configurations
Full fine-tuning requires distributed training:
| Model Size | Configuration | Estimated Cost* |
|---|---|---|
| 7B (full FT) | 2x A100 80GB | ~$6-12 per run |
| 13B (full FT) | 2x A100 80GB | ~$12-24 per run |
| 70B (full FT) | 8x A100 80GB | ~$100-300 per run |
| 70B (full FT) | 8x H100 | ~$100-250 per run (faster) |
Rough estimates for 1 epoch over a medium dataset (~10K examples). Actual costs depend on dataset size, context length, batch size, and learning rate schedule.
Use FSDP (Fully Sharded Data Parallelism) or DeepSpeed ZeRO-3 to distribute model weights, optimizer states, and gradients across GPUs. Both are well-supported in the Hugging Face Transformers + TRL ecosystem.
Training Framework Guide
Recommended Stack
For most fine-tuning jobs:
- Hugging Face TRL — high-level training library for SFT, DPO, RLHF
- PEFT — LoRA and QLoRA implementation
- bitsandbytes — 4-bit quantization for QLoRA
- Flash Attention 2 — memory-efficient attention (2-4x faster, less VRAM)
- Weights & Biases or MLflow — experiment tracking
Key Optimization Techniques
Gradient checkpointing: Trades compute for memory. Re-computes activations during backward pass instead of storing them. Reduces VRAM by 30-60% with ~20% training time overhead. Almost always worth enabling.
Flash Attention 2: Faster and more memory-efficient attention computation. Supports most modern models. Install via
pip install flash-attnPacking / sample packing: Concatenates short training examples into single sequences to maximize GPU utilization. Reduces padding waste, especially when training examples vary in length. Supported in TRL and Axolotl.
Gradient accumulation: Simulates larger batch sizes without more memory. If batch size 4 fits in VRAM, use batch size 1 with gradient accumulation of 4 for equivalent training dynamics.
Cost Comparison by Model Size
| Fine-Tuning Scenario | Best GPU | Approx. Time | Approx. Cost |
|---|---|---|---|
| 7B QLoRA (1K steps) | RTX 4090 | 20-40 min | $0.20-0.40 |
| 7B QLoRA (5K steps) | RTX 4090 | 2-3 hrs | $1.20-1.80 |
| 13B QLoRA (1K steps) | RTX 4090 | 30-60 min | $0.30-0.60 |
| 13B QLoRA (5K steps) | A100 80GB | 3-5 hrs | $4.50-7.50 |
| 70B QLoRA (1K steps) | A100 80GB | 1-2 hrs | $1.50-3.00 |
| 70B QLoRA (5K steps) | A100 80GB | 5-10 hrs | $7.50-15.00 |
| 7B Full FT (1 epoch, 10K examples) | 2x A100 80GB | 2-4 hrs | $6.00-12.00 |
| 70B Full FT (1 epoch, 10K examples) | 8x A100 80GB | 8-20 hrs | $95-240 |
Estimates based on typical training setups. Your costs may vary based on context length, dataset size, and optimization settings. Check JarvisLabs pricing for current rates.
Common Mistakes to Avoid
Using too large a GPU for small models. Don't rent an H100 to QLoRA fine-tune a 7B model. An RTX 4090 handles this at 1/4 the cost.
Skipping gradient checkpointing. The 20% training slowdown is almost always worth the 30-60% VRAM savings. Enable it unless you have VRAM to spare.
Ignoring learning rate. This affects results more than GPU choice. Use cosine schedule with warmup. Start with 2e-4 for QLoRA, 2e-5 for full fine-tuning.
Not evaluating during training. Run evaluations every 100-500 steps. You'll often find that training for fewer steps is sufficient, saving GPU time.
Training too long. More steps doesn't always mean better results. For QLoRA, 1,000-3,000 steps is often sufficient for 1K-10K training examples. Overfitting wastes both time and money.
FAQ
What's the cheapest way to fine-tune a 7B LLM?
QLoRA on an RTX 4090 ($0.59/hr on JarvisLabs). A typical 7B QLoRA training run (1,000 steps) costs under $0.40. Check our pricing page.
Can I fine-tune Llama 70B on a single GPU?
With QLoRA, yes — on an A100 80GB or H100. The 4-bit quantized model (~35GB) plus LoRA adapters fit in 80GB. For full fine-tuning of 70B, you need 8+ GPUs. See our Llama 70B GPU guide for more details.
LoRA vs QLoRA — which should I use?
QLoRA if you're budget-conscious or VRAM-limited — it uses significantly less memory. LoRA (on FP16 base) if you want slightly better quality and have the VRAM. For most practical applications, QLoRA results are close to LoRA.
How long does fine-tuning take?
Depends heavily on model size, dataset, and steps. Rough guide: 7B QLoRA (1K steps) ≈ 20-40 min on RTX 4090. 70B QLoRA (1K steps) ≈ 1-2 hours on A100 80GB. Full fine-tuning is 5-20x longer than QLoRA for the same model.
Do I need ECC memory for fine-tuning?
For short runs (under a few hours), bit errors are unlikely. For long training runs (24+ hours), ECC memory (available on A100, H100) provides protection against memory corruption. For QLoRA experiments, non-ECC GPUs (RTX 4090) are perfectly fine.
Should I fine-tune or use RAG?
Fine-tune when you need the model to learn a specific style, format, or domain knowledge. Use RAG when you need the model to reference specific documents or data that changes frequently. Many production systems use both — fine-tuned model + RAG retrieval.
How much training data do I need?
For QLoRA fine-tuning: 500-10,000 high-quality examples is typical. More data isn't always better — data quality matters more than quantity. For full fine-tuning or continued pre-training, you generally need 10K+ examples to see meaningful improvement.
Build & Deploy Your AI in Minutes
Get started with JarvisLabs today and experience the power of cloud GPU infrastructure designed specifically for AI development.
Related Articles
Should I Run Llama-405B on an NVIDIA H100 or A100 GPU?
Practical comparison of H100, A100, and H200 GPUs for running Llama 405B models. Get performance insights, cost analysis, and real-world recommendations from a technical founder's perspective.
What is the Best Speech-to-Text Models Available and Which GPU Should I Deploy it on?
Compare top speech-to-text models like OpenAI's GPT-4o Transcribe, Whisper, and Deepgram Nova-3 for accuracy, speed, and cost, plus learn which GPUs provide the best price-performance ratio for deployment.
Best Cloud GPU Providers for AI in 2026: Cheapest GPU Cloud Pricing Compared
Compare the cheapest cloud GPU providers for AI and machine learning in 2026. GPU cloud pricing comparison of JarvisLabs, RunPod, Vast.ai, Lambda, AWS, Google Cloud, and Azure. Find the best GPU for AI workloads by budget and use case.
NVIDIA A100 GPU Price Guide (2025) - Cloud Rental & Purchase Costs
Complete NVIDIA A100 pricing guide for 2025. Compare A100 40GB vs 80GB costs, cloud rental rates, purchase prices, and find the best value for AI training and inference workloads.
Should I run AI training on RTX 6000 Ada or NVIDIA A6000?
Comparing RTX 6000 Ada vs A6000 for AI training workloads. Learn about architecture differences, performance benchmarks, memory considerations, and cost-efficiency to make the right GPU choice for your projects.