What is the FLOPS Performance of the NVIDIA H100 GPU?

Vishnu Subramanian
Vishnu Subramanian
Founder @JarvisLabs.ai

The NVIDIA H100 delivers exceptional FLOPS performance across different precisions: up to 989 TFLOPS for FP8 Tensor operations, 495 TFLOPS for FP16, and 60 TFLOPS for FP64 computing. This represents a 3x improvement in double-precision performance over the A100.

Understanding H100 FLOPS Performance

FLOPS (Floating Point Operations Per Second) is the key metric for measuring GPU compute performance. The H100 triples the floating-point operations per second (FLOPS) of double-precision Tensor Cores, delivering 60 teraflops of FP64 computing for HPC.

The H100 features a theoretical peak performance exceeding 1 ExaFLOP of AI compute power using FP8 precision, designed for next-gen large-scale AI applications. That's 1,000 TFLOPS—a staggering number that showcases the H100's capability for the most demanding AI workloads.

H100 FLOPS Breakdown by Precision

PrecisionH100 SXM5 PerformanceH100 PCIe PerformanceUse Case
FP8 Tensor~989 TFLOPS~850 TFLOPSLarge Language Models, Transformers
FP16 Tensor~495 TFLOPS~204.9 TFLOPSAI Training, Mixed Precision
BF16~495 TFLOPS~204.9 TFLOPSAI Training, Stable Training
TF321 petaflop for single-precision matrix-multiply operations~83 TFLOPSZero-code AI acceleration
FP32~83 TFLOPS~51.22 TFLOPSGeneral compute, legacy workloads
FP6460 teraflops~25.61 TFLOPSHPC, Scientific computing

The standout feature is FP64 Tensor Core: Up to approximately 67 TFLOPS (Tensor Core accelerated) for the SXM5 variant, which significantly outpaces traditional FP64 performance.

SXM5 vs PCIe Performance Comparison

The form factor dramatically affects FLOPS performance:

H100 SXM5 (High-End)

  • 16896 shading units, 528 tensor cores
  • 700W TDP enables maximum performance
  • over 3 TB/sec of memory bandwidth
  • Optimized for multi-GPU deployments

H100 PCIe (Mainstream)

  • 14592 shading units, 456 tensor cores
  • 350W TDP for standard server configurations
  • 2.04 TB/s memory bandwidth
  • 65% of SXM5 performance at 50% power consumption

From my experience, the SXM5 delivers approximately 1.5-2x higher FLOPS across all precisions compared to the PCIe variant—the extra power budget and tensor cores make a tangible difference.

How H100 FLOPS Translate to Real Performance

Raw FLOPS numbers tell only part of the story. The H100 allows for a reported 9x speedup on training and 30x increase on inference throughputs compared to the A100.

The fourth-generation Tensor Cores enable these gains through:

  • FP8 precision support: Unprecedented throughput for transformer models
  • Improved sparsity handling: Better utilization of sparse tensors common in AI
  • Enhanced memory hierarchy: 50 MB L2 cache enables caching of even larger portions of models

Architectural Improvements Behind the FLOPS

The H100 SM quadruples the A100 peak per SM floating point computational power due to the introduction of FP8, and doubles the A100 raw SM computational power on all previous Tensor Core, FP32, and FP64 data types.

Key architectural enhancements:

  • 4th Gen Tensor Cores: Native FP8 and improved sparsity support
  • Transformer Engine: Automatic mixed-precision optimization for LLMs
  • HBM3 Memory: 2x bandwidth increase over the memory bandwidth of A100

Practical FLOPS Considerations

While the theoretical FLOPS are impressive, real-world performance depends on:

Memory Bandwidth Limitations: Even with exceptional FLOPS, memory-bound workloads won't fully utilize compute capacity. The H100's over 3 TB/sec of memory bandwidth helps, but remains a consideration for large models.

Precision Trade-offs: FP8 offers maximum FLOPS but requires careful validation for accuracy-sensitive applications. FP16 provides a good balance for most AI workloads.

Tensor Core Utilization: Maximum FLOPS require tensor-optimized operations. Generic compute may not achieve peak performance figures.

Cost vs FLOPS Analysis

At JarvisLabs, we price the H100 SXM at ₹242.19/hour (about $2.99/hour USD). Here's how the FLOPS/dollar economics work out:

FP16 Performance: ~165 TFLOPS per dollar per hour FP64 Performance: ~20 TFLOPS per dollar per hour

Compared to the A100 at ₹104.49/hour, the H100 provides roughly 2x the FLOPS/dollar for FP16 workloads despite the higher cost.

When H100 FLOPS Matter Most

Choose H100 for maximum FLOPS when:

  • Training large language models: FP8 precision shines for transformer architectures
  • Real-time inference: High FLOPS enable low-latency responses
  • HPC simulations: FP64 Tensor Core acceleration benefits scientific computing
  • Multi-modal AI: High tensor throughput handles complex model architectures

Optimization Tips for Maximum FLOPS

From our experience deploying thousands of H100 instances:

  1. Use mixed precision: Let the Transformer Engine automatically optimize precision
  2. Batch size optimization: Larger batches better utilize tensor cores
  3. Memory layout: Ensure tensors are properly aligned for maximum throughput
  4. Framework selection: PyTorch 2.x and TensorFlow 2.13+ have the best H100 optimizations

Key Takeaways

The H100's FLOPS performance represents a generational leap: 26 teraFLOPS on full precision (fp64) procedures for scientific computing, nearly 1 ExaFLOP for AI workloads, and architectural improvements that translate theoretical performance into real-world gains.

For most AI practitioners, the H100's combination of FP8 tensor performance and smart precision management delivers the compute power needed for today's ambitious models—and tomorrow's even larger ones.

What specific workload are you optimizing? Understanding your precision requirements and memory patterns can help determine whether the H100's impressive FLOPS translate into meaningful performance improvements for your use case.

Build & Deploy Your AI in Minutes

Get started with JarvisLabs today and experience the power of cloud GPU infrastructure designed specifically for AI development.

← Back to FAQs
What is the FLOPS Performance of the NVIDIA H100 GPU? | AI FAQ | Jarvis Labs