Best GPU for Stable Diffusion: SDXL, SD 1.5, and FLUX (2026 Guide)
For Stable Diffusion XL inference, an RTX 4090 (24GB, $0.59/hr on JarvisLabs) is the sweet spot — fast generation, enough VRAM for high-resolution outputs, and great value. For SD 1.5, even an L4 (24GB, $0.44/hr) works well. For FLUX, see our dedicated FLUX GPU guide. For training or fine-tuning custom models, an A100 80GB gives you room for larger batch sizes and LoRA training at higher resolutions.
VRAM Requirements by Model
| Model | Minimum VRAM | Recommended VRAM | Notes |
|---|---|---|---|
| SD 1.5 | 4GB | 8-12GB | Runs on almost anything |
| SDXL | 8GB | 12-24GB | 1024×1024 base resolution |
| SDXL + Refiner | 12GB | 24GB | Two models loaded simultaneously |
| SD 3 Medium | 8GB | 16-24GB | MMDiT architecture |
| FLUX.1 Dev | 12GB | 24GB+ | See FLUX GPU guide |
| FLUX.1 Schnell | 10GB | 16-24GB | Faster, lighter FLUX variant |
These are for inference with default settings. Higher resolutions, larger batch sizes, ControlNet, IP-Adapter, and other extensions increase VRAM requirements.
GPU Recommendations
Best Overall: NVIDIA RTX 4090 (24GB)
The RTX 4090 hits the sweet spot for Stable Diffusion:
- 24GB VRAM — handles SDXL at full resolution with room for ControlNet, LoRAs, and upscaling
- Fast generation — SDXL 1024×1024 in ~3-5 seconds per image (20 steps, Euler sampler)
- Good tensor performance — 4th-gen Tensor Cores accelerate both inference and training
- $0.59/hr on JarvisLabs — best price-to-performance for image generation
For most Stable Diffusion users — whether generating images, running ComfyUI workflows, or fine-tuning LoRAs — the RTX 4090 is the GPU to pick. Check our pricing page.
Best Budget: NVIDIA L4 (24GB)
The L4 is surprisingly capable for image generation:
- 24GB VRAM — same memory as RTX 4090, so it runs SDXL and FLUX without issues
- Slower generation — roughly 2-3x slower than RTX 4090 for inference
- $0.44/hr on JarvisLabs — cheapest option that comfortably runs SDXL
- Low power — 72W TDP, efficient for batch generation
Best for: batch generation jobs where speed per image matters less than cost per image. If you're generating thousands of images overnight, the L4's lower hourly rate can be more cost-effective despite slower generation.
Best for Training: NVIDIA A100 80GB
For fine-tuning Stable Diffusion models or training LoRAs:
- 80GB VRAM — large batch sizes, high-resolution training (1024×1024+), multiple models in memory
- High memory bandwidth (2.0 TB/s) — faster data loading during training
- $1.49/hr on JarvisLabs — reasonable for training sessions that take hours
- Multi-GPU support — scale to 2-4 A100s for faster training
Fine-tuning SDXL LoRAs works on an RTX 4090 (24GB) for most cases, but full model fine-tuning and DreamBooth training at high resolutions benefit from the A100's extra memory.
Best for Production: NVIDIA H100 (80GB)
For production image generation APIs serving many users:
- Highest throughput — fastest single-GPU generation speeds
- 80GB VRAM — serve multiple models simultaneously (SD, SDXL, FLUX)
- FP8 support — run quantized models with minimal quality loss for higher throughput
- $2.69/hr on JarvisLabs — worth it when throughput per dollar matters
The H100 makes sense when you're running a production service and need maximum images-per-second. For individual generation, the RTX 4090 is more cost-effective.
GPU Comparison for SDXL
| GPU | VRAM | SDXL Speed (approx) | Price/hr | Cost per 1000 images |
|---|---|---|---|---|
| H100 | 80GB | ~1.5-2s/image | $2.69 | ~$1.50-2.00 |
| A100 80GB | 80GB | ~3-4s/image | $1.49 | ~$1.50-2.00 |
| RTX 4090 | 24GB | ~3-5s/image | $0.59 | ~$0.80-1.20 |
| RTX 6000 Ada | 48GB | ~3-4s/image | $0.99 | ~$1.00-1.50 |
| A6000 | 48GB | ~5-7s/image | $0.79 | ~$1.20-1.80 |
| L4 | 24GB | ~8-12s/image | $0.44 | ~$1.20-1.80 |
| A5000 | 24GB | ~8-10s/image | $0.49 | ~$1.40-1.80 |
| RTX 3090 | 24GB | ~5-7s/image | $0.29 | ~$0.60-0.90 |
Speeds are approximate for SDXL 1024×1024, 20 steps, Euler sampler, batch size 1. Actual performance varies by sampler, step count, extensions, and software stack.
Key insight: The RTX 3090 ($0.29/hr) offers the lowest cost per image for batch workloads where you don't need the latest architecture features. The RTX 4090 offers the best speed-per-dollar for interactive generation.
Optimizing Stable Diffusion Performance
Software Stack Matters
The right software stack can 2-5x your generation speed regardless of GPU:
- xformers — memory-efficient attention, reduces VRAM usage and speeds up generation
- torch.compile — PyTorch 2.0+ compilation can significantly speed up repeated inference
- TensorRT — NVIDIA's inference optimizer, provides the fastest generation but requires model conversion
- SDPA (Scaled Dot Product Attention) — built into PyTorch 2.0+, automatic optimization
ComfyUI vs Automatic1111
ComfyUI is generally more memory-efficient than Automatic1111 WebUI, especially for complex workflows with multiple models. If you're hitting VRAM limits, switching to ComfyUI can help.
Batch Generation Tips
For generating large batches of images:
- Use the largest batch size that fits in VRAM — GPU utilization improves with larger batches
- Keep models loaded — loading/unloading models between generations wastes time
- Use FP16 — half precision is the standard for inference, no quality loss vs FP32
- Consider L4 or RTX 3090 for overnight batches — lower hourly cost matters more than speed for batch jobs
Fine-Tuning and Training
LoRA Training
LoRA fine-tuning is the most common way to customize Stable Diffusion. VRAM requirements:
| Training Task | Minimum GPU | Recommended GPU |
|---|---|---|
| SD 1.5 LoRA (512×512) | 8GB+ GPU | RTX 4090 (24GB) |
| SDXL LoRA (1024×1024) | 16GB+ GPU | RTX 4090 (24GB) or A100 |
| DreamBooth (SD 1.5) | 16GB+ GPU | RTX 4090 (24GB) |
| DreamBooth (SDXL) | 24GB+ GPU | A100 80GB |
| Full fine-tune (SDXL) | 48GB+ GPU | A100 80GB or 2x RTX 4090 |
Training Time Estimates
LoRA training for 1,000 steps on SDXL:
| GPU | Approximate Time |
|---|---|
| H100 | 5-10 minutes |
| A100 80GB | 10-15 minutes |
| RTX 4090 | 10-20 minutes |
| A6000 | 15-25 minutes |
| L4 | 25-40 minutes |
Times vary significantly based on resolution, batch size, optimizer, and whether gradient checkpointing is enabled.
Which Resolution Needs Which GPU?
| Resolution | VRAM Needed (SDXL) | Minimum GPU |
|---|---|---|
| 512×512 | ~6-8GB | Any 8GB+ GPU |
| 768×768 | ~8-10GB | 12GB+ GPU |
| 1024×1024 (native SDXL) | ~10-12GB | 12GB+ GPU |
| 1024×1024 + ControlNet | ~14-18GB | 24GB GPU |
| 1536×1536 | ~18-22GB | 24GB GPU |
| 2048×2048 | ~28-35GB | 48GB+ GPU (A6000, RTX 6000 Ada) |
For resolutions above 1024×1024, consider using SDXL's native 1024×1024 generation followed by an upscaler (Real-ESRGAN, 4x-UltraSharp) rather than generating at high resolution directly.
FAQ
What is the minimum GPU for Stable Diffusion?
SD 1.5 runs on 4GB+ GPUs. SDXL needs at least 8GB (tight) and is comfortable with 12-24GB. For a good experience with SDXL and extensions, 24GB (RTX 4090, L4) is recommended.
Is RTX 4090 or A100 better for Stable Diffusion?
RTX 4090 for inference — it's faster per dollar for image generation. A100 for training — the 80GB VRAM enables larger batch sizes and higher-resolution training. For most users generating images, RTX 4090 is the pick.
Can I run SDXL and ControlNet together?
Yes, with 24GB VRAM (RTX 4090, L4, or better). SDXL base model (~6GB) plus ControlNet (~2-4GB) plus generation buffers fit comfortably in 24GB. Running SDXL + refiner + ControlNet simultaneously may need more.
How much does it cost to generate 10,000 images?
On JarvisLabs with an RTX 4090 ($0.59/hr) generating SDXL at ~4 seconds per image: 10,000 images ≈ 11 hours ≈ $6.50. With an RTX 3090 ($0.29/hr) at ~6 seconds per image: ~17 hours ≈ $4.90. Check our pricing page.
Should I buy a GPU or rent cloud GPUs for Stable Diffusion?
If you generate images daily for hours, buying an RTX 4090 ($1,599) pays for itself within a few months versus cloud rental. If you generate intermittently or need multiple GPUs for batch jobs, cloud rental is more flexible and avoids the upfront cost.
What GPU do I need for FLUX?
FLUX requires more VRAM than SDXL — 12GB minimum, 24GB recommended. See our complete FLUX GPU guide for detailed recommendations.
Build & Deploy Your AI in Minutes
Get started with JarvisLabs today and experience the power of cloud GPU infrastructure designed specifically for AI development.
Related Articles
What is the Best Speech-to-Text Models Available and Which GPU Should I Deploy it on?
Compare top speech-to-text models like OpenAI's GPT-4o Transcribe, Whisper, and Deepgram Nova-3 for accuracy, speed, and cost, plus learn which GPUs provide the best price-performance ratio for deployment.
Best GPU for Fine-Tuning LLMs: LoRA, QLoRA, and Full Fine-Tuning Guide
Find the best GPU for fine-tuning large language models. Compare H100, A100, RTX 4090, and other GPUs for LoRA, QLoRA, and full fine-tuning with VRAM requirements, training times, and cloud pricing.
Best GPU for FLUX: VRAM Requirements, Speed, and Cloud Pricing
Find the best GPU for running FLUX.1 image generation models. Compare GPU requirements for FLUX Dev, FLUX Schnell, and FLUX Pro with VRAM needs, generation speeds, and cloud GPU pricing.
Best GPU for Running Llama 70B: Memory, Performance, and Cost Guide
Find the best GPU for running Llama 3 70B and Llama 2 70B. Compare H100, H200, A100, and RTX 4090 for inference and fine-tuning with real memory requirements, quantization options, and cloud pricing.
Best Google Colab Alternatives for GPU-Powered AI Development (2026)
Compare the best Google Colab alternatives for deep learning. Find reliable GPU access, better session limits, and predictable pricing for AI training.