Best GPU for Stable Diffusion: SDXL, SD 1.5, and FLUX (2026 Guide)

Vishnu Subramanian

Founder @JarvisLabs.ai

For Stable Diffusion XL inference, an RTX 4090 (24GB, $0.59/hr on JarvisLabs) is the sweet spot — fast generation, enough VRAM for high-resolution outputs, and great value. For SD 1.5, even an L4 (24GB, $0.44/hr) works well. For FLUX, see our dedicated FLUX GPU guide. For training or fine-tuning custom models, an A100 80GB gives you room for larger batch sizes and LoRA training at higher resolutions.

VRAM Requirements by Model

Model	Minimum VRAM	Recommended VRAM	Notes
SD 1.5	4GB	8-12GB	Runs on almost anything
SDXL	8GB	12-24GB	1024×1024 base resolution
SDXL + Refiner	12GB	24GB	Two models loaded simultaneously
SD 3 Medium	8GB	16-24GB	MMDiT architecture
FLUX.1 Dev	12GB	24GB+	See FLUX GPU guide
FLUX.1 Schnell	10GB	16-24GB	Faster, lighter FLUX variant

These are for inference with default settings. Higher resolutions, larger batch sizes, ControlNet, IP-Adapter, and other extensions increase VRAM requirements.

GPU Recommendations

Best Overall: NVIDIA RTX 4090 (24GB)

The RTX 4090 hits the sweet spot for Stable Diffusion:

24GB VRAM — handles SDXL at full resolution with room for ControlNet, LoRAs, and upscaling
Fast generation — SDXL 1024×1024 in ~3-5 seconds per image (20 steps, Euler sampler)
Good tensor performance — 4th-gen Tensor Cores accelerate both inference and training
$0.59/hr on JarvisLabs — best price-to-performance for image generation

For most Stable Diffusion users — whether generating images, running ComfyUI workflows, or fine-tuning LoRAs — the RTX 4090 is the GPU to pick. Check our pricing page.

Best Budget: NVIDIA L4 (24GB)

The L4 is surprisingly capable for image generation:

24GB VRAM — same memory as RTX 4090, so it runs SDXL and FLUX without issues
Slower generation — roughly 2-3x slower than RTX 4090 for inference
$0.44/hr on JarvisLabs — cheapest option that comfortably runs SDXL
Low power — 72W TDP, efficient for batch generation

Best for: batch generation jobs where speed per image matters less than cost per image. If you're generating thousands of images overnight, the L4's lower hourly rate can be more cost-effective despite slower generation.

Best for Training: NVIDIA A100 80GB

For fine-tuning Stable Diffusion models or training LoRAs:

80GB VRAM — large batch sizes, high-resolution training (1024×1024+), multiple models in memory
High memory bandwidth (2.0 TB/s) — faster data loading during training
$1.49/hr on JarvisLabs — reasonable for training sessions that take hours
Multi-GPU support — scale to 2-4 A100s for faster training

Fine-tuning SDXL LoRAs works on an RTX 4090 (24GB) for most cases, but full model fine-tuning and DreamBooth training at high resolutions benefit from the A100's extra memory.

Best for Production: NVIDIA H100 (80GB)

For production image generation APIs serving many users:

Highest throughput — fastest single-GPU generation speeds
80GB VRAM — serve multiple models simultaneously (SD, SDXL, FLUX)
FP8 support — run quantized models with minimal quality loss for higher throughput
$2.69/hr on JarvisLabs — worth it when throughput per dollar matters

The H100 makes sense when you're running a production service and need maximum images-per-second. For individual generation, the RTX 4090 is more cost-effective.

GPU Comparison for SDXL

GPU	VRAM	SDXL Speed (approx)	Price/hr	Cost per 1000 images
H100	80GB	~1.5-2s/image	$2.69	~$1.50-2.00
A100 80GB	80GB	~3-4s/image	$1.49	~$1.50-2.00
RTX 4090	24GB	~3-5s/image	$0.59	~$0.80-1.20
RTX 6000 Ada	48GB	~3-4s/image	$0.99	~$1.00-1.50
A6000	48GB	~5-7s/image	$0.79	~$1.20-1.80
L4	24GB	~8-12s/image	$0.44	~$1.20-1.80
A5000	24GB	~8-10s/image	$0.49	~$1.40-1.80
RTX 3090	24GB	~5-7s/image	$0.29	~$0.60-0.90

Speeds are approximate for SDXL 1024×1024, 20 steps, Euler sampler, batch size 1. Actual performance varies by sampler, step count, extensions, and software stack.

Key insight: The RTX 3090 ($0.29/hr) offers the lowest cost per image for batch workloads where you don't need the latest architecture features. The RTX 4090 offers the best speed-per-dollar for interactive generation.

Optimizing Stable Diffusion Performance

Software Stack Matters

The right software stack can 2-5x your generation speed regardless of GPU:

xformers — memory-efficient attention, reduces VRAM usage and speeds up generation
torch.compile — PyTorch 2.0+ compilation can significantly speed up repeated inference
TensorRT — NVIDIA's inference optimizer, provides the fastest generation but requires model conversion
SDPA (Scaled Dot Product Attention) — built into PyTorch 2.0+, automatic optimization

ComfyUI vs Automatic1111

ComfyUI is generally more memory-efficient than Automatic1111 WebUI, especially for complex workflows with multiple models. If you're hitting VRAM limits, switching to ComfyUI can help.

Batch Generation Tips

For generating large batches of images:

Use the largest batch size that fits in VRAM — GPU utilization improves with larger batches
Keep models loaded — loading/unloading models between generations wastes time
Use FP16 — half precision is the standard for inference, no quality loss vs FP32
Consider L4 or RTX 3090 for overnight batches — lower hourly cost matters more than speed for batch jobs

Fine-Tuning and Training

LoRA Training

LoRA fine-tuning is the most common way to customize Stable Diffusion. VRAM requirements:

Training Task	Minimum GPU	Recommended GPU
SD 1.5 LoRA (512×512)	8GB+ GPU	RTX 4090 (24GB)
SDXL LoRA (1024×1024)	16GB+ GPU	RTX 4090 (24GB) or A100
DreamBooth (SD 1.5)	16GB+ GPU	RTX 4090 (24GB)
DreamBooth (SDXL)	24GB+ GPU	A100 80GB
Full fine-tune (SDXL)	48GB+ GPU	A100 80GB or 2x RTX 4090

Training Time Estimates

LoRA training for 1,000 steps on SDXL:

GPU	Approximate Time
H100	5-10 minutes
A100 80GB	10-15 minutes
RTX 4090	10-20 minutes
A6000	15-25 minutes
L4	25-40 minutes

Times vary significantly based on resolution, batch size, optimizer, and whether gradient checkpointing is enabled.

Which Resolution Needs Which GPU?

Resolution	VRAM Needed (SDXL)	Minimum GPU
512×512	~6-8GB	Any 8GB+ GPU
768×768	~8-10GB	12GB+ GPU
1024×1024 (native SDXL)	~10-12GB	12GB+ GPU
1024×1024 + ControlNet	~14-18GB	24GB GPU
1536×1536	~18-22GB	24GB GPU
2048×2048	~28-35GB	48GB+ GPU (A6000, RTX 6000 Ada)

For resolutions above 1024×1024, consider using SDXL's native 1024×1024 generation followed by an upscaler (Real-ESRGAN, 4x-UltraSharp) rather than generating at high resolution directly.

FAQ

What is the minimum GPU for Stable Diffusion?

SD 1.5 runs on 4GB+ GPUs. SDXL needs at least 8GB (tight) and is comfortable with 12-24GB. For a good experience with SDXL and extensions, 24GB (RTX 4090, L4) is recommended.

Is RTX 4090 or A100 better for Stable Diffusion?

RTX 4090 for inference — it's faster per dollar for image generation. A100 for training — the 80GB VRAM enables larger batch sizes and higher-resolution training. For most users generating images, RTX 4090 is the pick.

Can I run SDXL and ControlNet together?

Yes, with 24GB VRAM (RTX 4090, L4, or better). SDXL base model (~6GB) plus ControlNet (~2-4GB) plus generation buffers fit comfortably in 24GB. Running SDXL + refiner + ControlNet simultaneously may need more.

How much does it cost to generate 10,000 images?

On JarvisLabs with an RTX 4090 ($0.59/hr) generating SDXL at ~4 seconds per image: 10,000 images ≈ 11 hours ≈ $6.50. With an RTX 3090 ($0.29/hr) at ~6 seconds per image: ~17 hours ≈ $4.90. Check our pricing page.

Should I buy a GPU or rent cloud GPUs for Stable Diffusion?

If you generate images daily for hours, buying an RTX 4090 ($1,599) pays for itself within a few months versus cloud rental. If you generate intermittently or need multiple GPUs for batch jobs, cloud rental is more flexible and avoids the upfront cost.

What GPU do I need for FLUX?

FLUX requires more VRAM than SDXL — 12GB minimum, 24GB recommended. See our complete FLUX GPU guide for detailed recommendations.

Build & Deploy Your AI in Minutes

Get started with JarvisLabs today and experience the power of cloud GPU infrastructure designed specifically for AI development.

← Back to FAQs