What is the FLOPS Performance of the NVIDIA H100 GPU?

Vishnu Subramanian

Founder @JarvisLabs.ai

The NVIDIA H100 delivers exceptional FLOPS performance across different precisions: up to 989 TFLOPS for FP8 Tensor operations, 495 TFLOPS for FP16, and 60 TFLOPS for FP64 computing. This represents a 3x improvement in double-precision performance over the A100.

Understanding H100 FLOPS Performance

FLOPS (Floating Point Operations Per Second) is the key metric for measuring GPU compute performance. The H100 triples the floating-point operations per second (FLOPS) of double-precision Tensor Cores, delivering 60 teraflops of FP64 computing for HPC.

The H100 features a theoretical peak performance exceeding 1 ExaFLOP of AI compute power using FP8 precision, designed for next-gen large-scale AI applications. That's 1,000 TFLOPS—a staggering number that showcases the H100's capability for the most demanding AI workloads.

H100 FLOPS Breakdown by Precision

Precision	H100 SXM5 Performance	H100 PCIe Performance	Use Case
FP8 Tensor	~989 TFLOPS	~850 TFLOPS	Large Language Models, Transformers
FP16 Tensor	~495 TFLOPS	~204.9 TFLOPS	AI Training, Mixed Precision
BF16	~495 TFLOPS	~204.9 TFLOPS	AI Training, Stable Training
TF32	1 petaflop for single-precision matrix-multiply operations	~83 TFLOPS	Zero-code AI acceleration
FP32	~83 TFLOPS	~51.22 TFLOPS	General compute, legacy workloads
FP64	60 teraflops	~25.61 TFLOPS	HPC, Scientific computing

The standout feature is FP64 Tensor Core: Up to approximately 67 TFLOPS (Tensor Core accelerated) for the SXM5 variant, which significantly outpaces traditional FP64 performance.

SXM5 vs PCIe Performance Comparison

The form factor dramatically affects FLOPS performance:

H100 SXM5 (High-End)

16896 shading units, 528 tensor cores
700W TDP enables maximum performance
over 3 TB/sec of memory bandwidth
Optimized for multi-GPU deployments

H100 PCIe (Mainstream)

14592 shading units, 456 tensor cores
350W TDP for standard server configurations
2.04 TB/s memory bandwidth
65% of SXM5 performance at 50% power consumption

From my experience, the SXM5 delivers approximately 1.5-2x higher FLOPS across all precisions compared to the PCIe variant—the extra power budget and tensor cores make a tangible difference.

How H100 FLOPS Translate to Real Performance

Raw FLOPS numbers tell only part of the story. The H100 allows for a reported 9x speedup on training and 30x increase on inference throughputs compared to the A100.

The fourth-generation Tensor Cores enable these gains through:

FP8 precision support: Unprecedented throughput for transformer models
Improved sparsity handling: Better utilization of sparse tensors common in AI
Enhanced memory hierarchy: 50 MB L2 cache enables caching of even larger portions of models

Architectural Improvements Behind the FLOPS

The H100 SM quadruples the A100 peak per SM floating point computational power due to the introduction of FP8, and doubles the A100 raw SM computational power on all previous Tensor Core, FP32, and FP64 data types.

Key architectural enhancements:

4th Gen Tensor Cores: Native FP8 and improved sparsity support
Transformer Engine: Automatic mixed-precision optimization for LLMs
HBM3 Memory: 2x bandwidth increase over the memory bandwidth of A100

Practical FLOPS Considerations

While the theoretical FLOPS are impressive, real-world performance depends on:

Memory Bandwidth Limitations: Even with exceptional FLOPS, memory-bound workloads won't fully utilize compute capacity. The H100's over 3 TB/sec of memory bandwidth helps, but remains a consideration for large models.

Precision Trade-offs: FP8 offers maximum FLOPS but requires careful validation for accuracy-sensitive applications. FP16 provides a good balance for most AI workloads.

Tensor Core Utilization: Maximum FLOPS require tensor-optimized operations. Generic compute may not achieve peak performance figures.

Cost vs FLOPS Analysis

At JarvisLabs, we price the H100 SXM at ₹242.19/hour (about $2.99/hour USD). Here's how the FLOPS/dollar economics work out:

FP16 Performance: ~165 TFLOPS per dollar per hour FP64 Performance: ~20 TFLOPS per dollar per hour

Compared to the A100 at ₹104.49/hour, the H100 provides roughly 2x the FLOPS/dollar for FP16 workloads despite the higher cost.

When H100 FLOPS Matter Most

Choose H100 for maximum FLOPS when:

Training large language models: FP8 precision shines for transformer architectures
Real-time inference: High FLOPS enable low-latency responses
HPC simulations: FP64 Tensor Core acceleration benefits scientific computing
Multi-modal AI: High tensor throughput handles complex model architectures

Optimization Tips for Maximum FLOPS

From our experience deploying thousands of H100 instances:

Use mixed precision: Let the Transformer Engine automatically optimize precision
Batch size optimization: Larger batches better utilize tensor cores
Memory layout: Ensure tensors are properly aligned for maximum throughput
Framework selection: PyTorch 2.x and TensorFlow 2.13+ have the best H100 optimizations

Key Takeaways

The H100's FLOPS performance represents a generational leap: 26 teraFLOPS on full precision (fp64) procedures for scientific computing, nearly 1 ExaFLOP for AI workloads, and architectural improvements that translate theoretical performance into real-world gains.

For most AI practitioners, the H100's combination of FP8 tensor performance and smart precision management delivers the compute power needed for today's ambitious models—and tomorrow's even larger ones.

What specific workload are you optimizing? Understanding your precision requirements and memory patterns can help determine whether the H100's impressive FLOPS translate into meaningful performance improvements for your use case.

Build & Deploy Your AI in Minutes

Get started with JarvisLabs today and experience the power of cloud GPU infrastructure designed specifically for AI development.

← Back to FAQs

What is the FLOPS Performance of the NVIDIA H100 GPU?

Understanding H100 FLOPS Performance

H100 FLOPS Breakdown by Precision

SXM5 vs PCIe Performance Comparison

How H100 FLOPS Translate to Real Performance

Architectural Improvements Behind the FLOPS

Practical FLOPS Considerations

Cost vs FLOPS Analysis

When H100 FLOPS Matter Most

Optimization Tips for Maximum FLOPS

Key Takeaways

Build & Deploy Your AI in Minutes

Related Articles

Should I Run Llama-405B on an NVIDIA H100 or A100 GPU?

NVIDIA H100 GPU Pricing in India (2025)

What are the Differences Between NVIDIA A100 and H100 GPUs?

Should I run Llama 70B on an NVIDIA H100 or A100?

Which AI Models Can I Run on an NVIDIA A6000 GPU?