Blog

GPU Cloud Engineering & Guides

Engineering deep dives, GPU optimization guides, and practical tutorials from the JarvisLabs team.

All37 Engineering25 GPU Guides7 Tutorial5

What NVFP4 Gets You on the RTX PRO 6000 Blackwell

A serving benchmark of BF16, FP8, and NVFP4 on the NVIDIA RTX PRO 6000 Blackwell Server Edition, with ShareGPT throughput and GPQA/ARC accuracy numbers.

Jaydev Tonde·May 18, 2026·15 min read

Engineering

vLLM, SGLang, or TensorRT-LLM? Picking an LLM Serving Stack

A serving benchmark comparing vLLM, SGLang, and TensorRT-LLM on Qwen2.5-7B, Qwen3-30B-A3B, and Qwen3-32B with ShareGPT and RULER 16K across a full concurrency sweep on H100 GPUs.

May 18·40 min read

Engineering

Benchmarking Gemma 4 MTP vs DFlash on a Single H100

We benchmark Gemma 4 MTP and DFlash speculative decoding on a single H100 with vLLM and SPEED-Bench, comparing throughput, latency, category-level gains, and draft-token acceptance.

May 12·11 min read

Engineering

Expert Parallelism and Mixed Parallelism Strategies in vLLM

A deep dive into Expert Parallelism (EP) and mixed PP+DP, PP+TP strategies in vLLM, with H100 benchmarks on Qwen3.5-35B-A3B and dense models.

Apr 13·35 min read

Engineering

AI Videos, Music, and 3D Models from Your Terminal — ComfyUI on Cloud GPUs with Claude Code

A step-by-step guide to running ComfyUI workflows on cloud GPUs using the JarvisLabs CLI and Claude Code. Generate videos from photos, music from text, and 3D models from images — across multiple GPUs in parallel — all without leaving your terminal.

Mar 23·9 min read

Engineering

Introducing the JarvisLabs CLI: Let Your Agents Run the GPUs

Introducing jl, a command-line interface for the JarvisLabs GPU cloud built for both humans and coding agents. Provision GPUs, run training jobs, monitor experiments, and let your agents handle the infrastructure.

Mar 19·15 min read

Engineering

How We Made GPU Instance Launch 4x Faster

From 8 seconds to 1.8 — how we tore apart every layer of our instance creation pipeline in three days to make GPU launches feel instant.

Mar 10·14 min read

Engineering

Scaling LLM Inference: Data, Pipeline & Tensor Parallelism in vLLM

Learn how to scale LLM inference using data parallelism, pipeline parallelism, and tensor parallelism in vLLM. Practical guide with A100 GPU benchmarks comparing DP vs PP vs TP.

Mar 5·54 min read

Engineering

vLLM Optimization Techniques: 5 Practical Methods to Improve Performance

Learn 5 practical vLLM optimization methods: prefix caching, FP8 KV-cache, CPU offloading, disaggregated prefill/decode, and zero-reload sleep mode, with benchmark-backed guidance.

Feb 6·26 min read

Engineering

Disaggregated Prefill-Decode: The Architecture Behind Meta's LLM Serving

Part 1 of my LLM optimization research series. Exploring how Meta's disaggregated prefill-decode strategy separates prompt processing from token generation - and what it means for JarvisLabs.

Jan 29·11 min read

Engineering

The Complete Guide to LLM Quantization with vLLM: Benchmarks & Best Practices

Complete guide to LLM quantization with vLLM. Compare AWQ, GPTQ, Marlin, GGUF, and BitsandBytes with real benchmarks on Qwen2.5-32B using H200 GPU - 4-bit quantization tested for perplexity, HumanEval accuracy, and inference speed.

Jan 7·47 min read

Engineering

Deploying MiniMax M2.1 with vLLM: Complete Guide for Agentic Workloads

Learn how to deploy MiniMax M2.1 with vLLM for agentic workloads and coding assistants. Covers hardware requirements, tensor/expert parallelism, benchmarking on InstructCoder, tool calling with interleaved thinking, and integration with Claude Code, Cline, and Cursor.

Dec 26·10 min read

Engineering

Speculative Decoding in vLLM: Complete Guide to Faster LLM Inference

Learn how to speed up LLM inference by 1.4-1.6x using speculative decoding in vLLM. This guide covers Draft Models, N-Gram Matching, Suffix Decoding, MLP Speculators, and EAGLE-3 with real benchmarks on Llama-3.1-8B and Llama-3.3-70B.

Dec 18·34 min read

Engineering

CUDA Cores Explained

A deep dive into CUDA cores, Tensor Cores, precision modes, and other specialized GPU features that impact performance.

Dec 8·7 min read

Engineering

How to Run FLUX AI Image Generator with ComfyUI: Complete Setup Guide

Step-by-step guide to set up and run FLUX.1 Schnell for AI image generation using ComfyUI on cloud GPUs. Includes workflows, LoRA integration, and practical examples.

Nov 20·4 min read

Engineering

Uncensored LLM Models: A Complete Guide to Unfiltered AI Language Models

Explore uncensored LLM models, their differences from ChatGPT, and how they're built. Learn about foundation models, fine-tuning, and running unfiltered AI models locally.

Nov 20·4 min read

Engineering

ML Experiment Tracking: Complete Guide to W&B and Hydra

Learn how to effectively track and manage ML experiments using Weights & Biases (W&B) and Hydra. A comprehensive guide for machine learning practitioners and researchers.

Nov 20·22 min read

Engineering

ComfyUI Prompt Enhancement Guide: Using Ollama and LLMs for Better AI Image Generation

Learn how to improve your Stable Diffusion prompts using Ollama and LLMs in ComfyUI. Step-by-step guide to setup, workflow, and best practices for enhanced AI image generation.

Nov 20·3 min read

Engineering

Flux AI Image Generator Tutorial: Setup Guide for Cloud GPU (2024)

Step-by-step guide to install and run Flux.1 AI image generator on cloud GPU. Learn how to generate high-quality AI images using Flux's open-source model with detailed setup instructions and examples.

Aug 14·4 min read

Engineering

Create AI Training Datasets with Fooocus: Face Swap and Pose Matching Guide

Step-by-step guide to creating custom AI training datasets using Fooocus's face swap and pose matching features for Stable Diffusion model finetuning

Mar 23·3 min read

Engineering

How to Deploy and Connect with Ollama LLM Models: A Comprehensive Guide

Learn how to effectively deploy and interact with Ollama LLM models using terminal commands, local clients, and REST APIs. Discover tips for choosing the right GPU, managing storage, and troubleshooting common issues.

Mar 18·3 min read

Engineering

Boost PyTorch Performance with Hugging Face Accelerate: Multi-GPU & Mixed Precision Training

Discover how to enhance your PyTorch scripts using Hugging Face Accelerate for efficient multi-GPU and mixed precision training. Learn setup, configuration, and code adaptation for faster deep learning model training.

Feb 7·5 min read

Engineering

How to Train Billion-Parameter NLP Models on One GPU with DeepSpeed and HuggingFace

Learn how to train large language models efficiently using DeepSpeed and HuggingFace Trainer. This step-by-step guide shows you how to optimize GPU memory and train 10B+ parameter models on a single GPU using ZeRO-Offload.

Feb 1·5 min read

Engineering

Build a Toxic Comment Classifier with RoBERTa and PyTorch Lightning | Complete Tutorial

Learn how to build a toxic comment classifier using RoBERTa and PyTorch Lightning. This step-by-step tutorial covers mixed precision training, multi-GPU setup, and Weights & Biases integration for ML model tracking.

Dec 8·15 min read

Engineering

ResNet-50 Performance Optimization: Modern Training Techniques to Achieve 80.4% Accuracy

Learn how to boost ResNet-50 accuracy from 75.3% to 80.4% using advanced training techniques, including BCE loss, data augmentation, and optimization strategies. A comprehensive guide to modern CNN training best practices.

Oct 18·5 min read