Can I run Llama 70B in FP16 on H200?

Yes. 70B in FP16 needs ~140GB. H200's 141GB fits it on a single GPU — the only single GPU that can do this comfortably.

Is H200 faster than H100?

For memory-bound workloads (most LLM inference), yes. ~1.9x faster on Llama 2 70B per NVIDIA benchmarks. For compute-bound workloads, similar — same Tensor Cores.

Can I scale to multiple H200s?

Up to 8 per instance with 900 GB/s NVLink. 1,128GB unified memory.

When should I use H100 instead?

When your models fit in 80GB (FP8/INT8). H100 is cheaper and has identical compute. The H200 premium is only worth it for memory-bound workloads.

How does per-minute billing work?

Billed per minute. 24-hour run = $91.20 (24 x $3.80). Pause anytime, checkpoint, resume.

Hopper Architecture141GB HBM3e

NVIDIA H200 GPU

Q: Why choose H200 over H100?

76% more memory (141GB vs 80GB) and 43% more bandwidth (4.8 vs 3.35 TB/s). Same compute. Choose H200 when memory or bandwidth is your bottleneck.

From $3.80/hr — billed by the minute

The highest-memory Hopper GPU. 141GB HBM3e at 4.8 TB/s bandwidth. Run Llama 70B in full FP16 on a single GPU. No quantization compromises.

View All GPU Pricing

H200: $3.80/hr·Per-minute billing·No commitments

Trusted worldwide

Powering teams that push boundaries

27,000+AI developers

50M+GPU hours served

99.9%Uptime SLA

<90sInstance launch

Why H200

Maximum memory meets maximum bandwidth

The H200 delivers the memory and bandwidth needed for the largest AI inference and training workloads — without compromises.

141GB HBM3e Memory

76% more than H100. Fit 70B models in FP16 without quantization. Run multiple models simultaneously.

4.8 TB/s Memory Bandwidth

43% faster than H100. Token generation scales with bandwidth — faster inference for every LLM.

Zero-Compromise Inference

No quantization needed for 70B models. Full FP16 precision means maximum output quality.

Same Hopper Ecosystem

Identical CUDA/software stack as H100. Same Transformer Engine, same NVLink. Migration is seamless.

Specs

Key specs at a glance

141 GB

VRAM

HBM3e with ECC

4,800 GB/s

Memory Bandwidth

43% faster than H100

989 TFLOPS

Tensor Performance

FP16 / BF16

1,128 GB

Multi-GPU Memory

8x H200 via NVLink

Pricing

141GB of VRAM — rent a single GPU, not eight

Transparent, per-minute billing with no hidden fees. Pause anytime — only pay for active minutes.

Provider	H200 $/hr	Billing	Min. GPUs	Pre-configured
Jarvislabs	$3.80	Per minute	1
AWS	~$15-20/GPU	Per second	8 (bundled)	—
Google Cloud	Varies	Per second	Varies	—
RunPod	$4.49–5.49	Per second	1	—
Lambda	Varies	Per hour	1	—

Jarvislabs$3.80/hr

Per minute·Min 1 GPUPre-configured

AWS~$15-20/GPU/hr

Per second·Min 8 (bundled) GPUs

Google CloudVaries/hr

Per second·Min Varies GPUs

RunPod$4.49–5.49/hr

Per second·Min 1 GPU

LambdaVaries/hr

Per hour·Min 1 GPU

H200's 141GB eliminates the need for model quantization on 70B LLMs. Run at full FP16 precision with memory to spare for large KV caches and batch sizes.

Use Cases

What developers build on the H200

From full-precision LLM serving to large-scale training without compromises.

Full-Precision LLM Serving

Serve Llama 70B in FP16 on a single GPU. 141GB means no quantization. Maximum quality output for production.

vLLMTGITensorRT-LLM

Long-Context Inference

141GB handles massive KV caches. Serve 100K+ token contexts without memory pressure.

ClaudeGPT-4 classLlama 3

Multi-Model Serving

Load multiple models simultaneously. Run a 70B model + embedding model + reranker on one GPU.

LLM + embeddingsMulti-model pipelines

Large Model Training

Train 70B+ models with larger batch sizes. 8x H200 = 1,128GB for massive training runs.

LLaMAMixtralCustom architectures

Full Specs

Technical specifications

Complete hardware specifications for the NVIDIA H200 data center GPU.

Specification	Value	Great for
Architecture	NVIDIA Hopper (enhanced)	Next-gen AI performance
CUDA Cores	16,896	General-purpose GPU compute
Tensor Cores	528 (4th gen)	FP8/FP16/BF16/TF32/INT8/FP64
VRAM	141 GB HBM3e (ECC)	70B models in FP16 on single GPU
Memory Bandwidth	4,800 GB/s	43% faster than H100
FP16 Tensor	989 TFLOPS	Mixed-precision training & inference
BF16 Tensor	989 TFLOPS	LLM training (preferred)
TF32 Tensor	989 TFLOPS	Auto mixed-precision training
FP8 Tensor	1,979 TFLOPS	Transformer Engine optimized
INT8 Tensor	1,979 TOPS	Quantized inference
NVLink	900 GB/s bidirectional	Multi-GPU scaling
PCIe	Gen5 x16	Host data transfer
TDP	700W (SXM)	Maximum performance
Manufacturing	TSMC 4N	Advanced process node
Multi-GPU	Up to 8x per instance	1,128GB unified memory

Architecture

NVIDIA Hopper (enhanced)

Next-gen AI performance

CUDA Cores

16,896

General-purpose GPU compute

Tensor Cores

528 (4th gen)

FP8/FP16/BF16/TF32/INT8/FP64

VRAM

141 GB HBM3e (ECC)

70B models in FP16 on single GPU

Memory Bandwidth

4,800 GB/s

43% faster than H100

FP16 Tensor

989 TFLOPS

Mixed-precision training & inference

BF16 Tensor

989 TFLOPS

LLM training (preferred)

TF32 Tensor

989 TFLOPS

Auto mixed-precision training

FP8 Tensor

1,979 TFLOPS

Transformer Engine optimized

INT8 Tensor

1,979 TOPS

Quantized inference

NVLink

900 GB/s bidirectional

Multi-GPU scaling

PCIe

Gen5 x16

Host data transfer

TDP

700W (SXM)

Maximum performance

Manufacturing

TSMC 4N

Advanced process node

Multi-GPU

Up to 8x per instance

1,128GB unified memory

Get Started

Launch your H200 instance in seconds

Three simple steps from sign-up to a running GPU instance.

Choose Template

PyTorch 2.x (with Transformers, DeepSpeed, PEFT, Accelerate), TensorFlow, JAX, or clean CUDA.

Configure & Launch

Select H200, 1–8 GPUs, allocate storage. Templates ready in seconds, VMs in under a minute.

Train at Scale

DeepSpeed and PyTorch DDP pre-configured for multi-GPU. Pause when idle, resume from checkpoint.

Manage via CLI

Create and manage H200 instances from your terminal.

jl create --gpu H200

Explore CLI

27,343+

AI developers trust Jarvislabs

50M+

GPU hours served

99.9%

Uptime SLA

Compare with Other Providers

FAQ

Frequently asked questions

Everything you need to know about renting the NVIDIA H200 on Jarvislabs.

76% more memory (141GB vs 80GB) and 43% more bandwidth (4.8 vs 3.35 TB/s). Same compute. Choose H200 when memory or bandwidth is your bottleneck.

Start running on the NVIDIA H200 in seconds

$3.80/hr with per-minute billing. 141GB HBM3e. Up to 8 GPUs. No commitments.

Compare All GPUs