What is the FLOPS Performance of the NVIDIA H100 GPU?
The NVIDIA H100 delivers exceptional FLOPS performance across different precisions: up to 989 TFLOPS for FP8 Tensor operations, 495 TFLOPS for FP16, and 60 TFLOPS for FP64 computing. This represents a 3x improvement in double-precision performance over the A100.
Understanding H100 FLOPS Performance
FLOPS (Floating Point Operations Per Second) is the key metric for measuring GPU compute performance. The H100 triples the floating-point operations per second (FLOPS) of double-precision Tensor Cores, delivering 60 teraflops of FP64 computing for HPC.
The H100 features a theoretical peak performance exceeding 1 ExaFLOP of AI compute power using FP8 precision, designed for next-gen large-scale AI applications. That's 1,000 TFLOPS—a staggering number that showcases the H100's capability for the most demanding AI workloads.
H100 FLOPS Breakdown by Precision
| Precision | H100 SXM5 Performance | H100 PCIe Performance | Use Case |
|---|---|---|---|
| FP8 Tensor | ~989 TFLOPS | ~850 TFLOPS | Large Language Models, Transformers |
| FP16 Tensor | ~495 TFLOPS | ~204.9 TFLOPS | AI Training, Mixed Precision |
| BF16 | ~495 TFLOPS | ~204.9 TFLOPS | AI Training, Stable Training |
| TF32 | 1 petaflop for single-precision matrix-multiply operations | ~83 TFLOPS | Zero-code AI acceleration |
| FP32 | ~83 TFLOPS | ~51.22 TFLOPS | General compute, legacy workloads |
| FP64 | 60 teraflops | ~25.61 TFLOPS | HPC, Scientific computing |
The standout feature is FP64 Tensor Core: Up to approximately 67 TFLOPS (Tensor Core accelerated) for the SXM5 variant, which significantly outpaces traditional FP64 performance.
SXM5 vs PCIe Performance Comparison
The form factor dramatically affects FLOPS performance:
H100 SXM5 (High-End)
- 16896 shading units, 528 tensor cores
- 700W TDP enables maximum performance
- over 3 TB/sec of memory bandwidth
- Optimized for multi-GPU deployments
H100 PCIe (Mainstream)
- 14592 shading units, 456 tensor cores
- 350W TDP for standard server configurations
- 2.04 TB/s memory bandwidth
- 65% of SXM5 performance at 50% power consumption
From my experience, the SXM5 delivers approximately 1.5-2x higher FLOPS across all precisions compared to the PCIe variant—the extra power budget and tensor cores make a tangible difference.
How H100 FLOPS Translate to Real Performance
Raw FLOPS numbers tell only part of the story. The H100 allows for a reported 9x speedup on training and 30x increase on inference throughputs compared to the A100.
The fourth-generation Tensor Cores enable these gains through:
- FP8 precision support: Unprecedented throughput for transformer models
- Improved sparsity handling: Better utilization of sparse tensors common in AI
- Enhanced memory hierarchy: 50 MB L2 cache enables caching of even larger portions of models
Architectural Improvements Behind the FLOPS
The H100 SM quadruples the A100 peak per SM floating point computational power due to the introduction of FP8, and doubles the A100 raw SM computational power on all previous Tensor Core, FP32, and FP64 data types.
Key architectural enhancements:
- 4th Gen Tensor Cores: Native FP8 and improved sparsity support
- Transformer Engine: Automatic mixed-precision optimization for LLMs
- HBM3 Memory: 2x bandwidth increase over the memory bandwidth of A100
Practical FLOPS Considerations
While the theoretical FLOPS are impressive, real-world performance depends on:
Memory Bandwidth Limitations: Even with exceptional FLOPS, memory-bound workloads won't fully utilize compute capacity. The H100's over 3 TB/sec of memory bandwidth helps, but remains a consideration for large models.
Precision Trade-offs: FP8 offers maximum FLOPS but requires careful validation for accuracy-sensitive applications. FP16 provides a good balance for most AI workloads.
Tensor Core Utilization: Maximum FLOPS require tensor-optimized operations. Generic compute may not achieve peak performance figures.
Cost vs FLOPS Analysis
At JarvisLabs, we price the H100 SXM at ₹242.19/hour (about $2.99/hour USD). Here's how the FLOPS/dollar economics work out:
FP16 Performance: ~165 TFLOPS per dollar per hour FP64 Performance: ~20 TFLOPS per dollar per hour
Compared to the A100 at ₹104.49/hour, the H100 provides roughly 2x the FLOPS/dollar for FP16 workloads despite the higher cost.
When H100 FLOPS Matter Most
Choose H100 for maximum FLOPS when:
- Training large language models: FP8 precision shines for transformer architectures
- Real-time inference: High FLOPS enable low-latency responses
- HPC simulations: FP64 Tensor Core acceleration benefits scientific computing
- Multi-modal AI: High tensor throughput handles complex model architectures
Optimization Tips for Maximum FLOPS
From our experience deploying thousands of H100 instances:
- Use mixed precision: Let the Transformer Engine automatically optimize precision
- Batch size optimization: Larger batches better utilize tensor cores
- Memory layout: Ensure tensors are properly aligned for maximum throughput
- Framework selection: PyTorch 2.x and TensorFlow 2.13+ have the best H100 optimizations
Key Takeaways
The H100's FLOPS performance represents a generational leap: 26 teraFLOPS on full precision (fp64) procedures for scientific computing, nearly 1 ExaFLOP for AI workloads, and architectural improvements that translate theoretical performance into real-world gains.
For most AI practitioners, the H100's combination of FP8 tensor performance and smart precision management delivers the compute power needed for today's ambitious models—and tomorrow's even larger ones.
What specific workload are you optimizing? Understanding your precision requirements and memory patterns can help determine whether the H100's impressive FLOPS translate into meaningful performance improvements for your use case.
Build & Deploy Your AI in Minutes
Get started with JarvisLabs today and experience the power of cloud GPU infrastructure designed specifically for AI development.
Related Articles
Should I Run Llama-405B on an NVIDIA H100 or A100 GPU?
Practical comparison of H100, A100, and H200 GPUs for running Llama 405B models. Get performance insights, cost analysis, and real-world recommendations from a technical founder's perspective.
NVIDIA H100 GPU Pricing in India (2025)
Get H100 GPU access in India at ₹242.19/hour through JarvisLabs.ai with minute-level billing. Compare with RTX6000 Ada and A100 options, performance benefits, and discover when each GPU makes sense for your AI workloads.
What are the Differences Between NVIDIA A100 and H100 GPUs?
Compare NVIDIA A100 vs H100 GPUs across architecture, performance, memory, and cost. Learn when to choose each GPU for AI workloads and get practical guidance from a technical founder.
Should I run Llama 70B on an NVIDIA H100 or A100?
Should you run Llama 70B on H100 or A100? Compare 2–3× performance gains, memory + quantization trade-offs, cloud pricing, and get clear guidance on choosing the right GPU.
Which AI Models Can I Run on an NVIDIA A6000 GPU?
Discover which AI models fit on an A6000's 48GB VRAM, from 13B parameter LLMs at full precision to 70B models with quantization, plus practical performance insights and cost comparisons.