Blog

GPU Cloud Engineering & Guides

Engineering deep dives, GPU optimization guides, and practical tutorials from the JarvisLabs team.

Expert Parallelism and Mixed Parallelism Strategies in vLLM
EngineeringLatest

Expert Parallelism and Mixed Parallelism Strategies in vLLM

A deep dive into Expert Parallelism (EP) and mixed PP+DP, PP+TP strategies in vLLM, with H100 benchmarks on Qwen3.5-35B-A3B and dense models.

Jaydev Tonde·Apr 13, 2026·35 min read
AI Videos, Music, and 3D Models from Your Terminal — ComfyUI on Cloud GPUs with Claude Code
Engineering

AI Videos, Music, and 3D Models from Your Terminal — ComfyUI on Cloud GPUs with Claude Code

A step-by-step guide to running ComfyUI workflows on cloud GPUs using the JarvisLabs CLI and Claude Code. Generate videos from photos, music from text, and 3D models from images — across multiple GPUs in parallel — all without leaving your terminal.

Mar 23·9 min read
Introducing the JarvisLabs CLI: Let Your Agents Run the GPUs
Engineering

Introducing the JarvisLabs CLI: Let Your Agents Run the GPUs

Introducing jl, a command-line interface for the JarvisLabs GPU cloud built for both humans and coding agents. Provision GPUs, run training jobs, monitor experiments, and let your agents handle the infrastructure.

Mar 19·15 min read
How We Made GPU Instance Launch 4x Faster
Engineering

How We Made GPU Instance Launch 4x Faster

From 8 seconds to 1.8 — how we tore apart every layer of our instance creation pipeline in three days to make GPU launches feel instant.

Mar 10·14 min read
Scaling LLM Inference: Data, Pipeline & Tensor Parallelism in vLLM
Engineering

Scaling LLM Inference: Data, Pipeline & Tensor Parallelism in vLLM

Learn how to scale LLM inference using data parallelism, pipeline parallelism, and tensor parallelism in vLLM. Practical guide with A100 GPU benchmarks comparing DP vs PP vs TP.

Mar 5·54 min read
vLLM Optimization Techniques: 5 Practical Methods to Improve Performance
Engineering

vLLM Optimization Techniques: 5 Practical Methods to Improve Performance

Learn 5 practical vLLM optimization methods: prefix caching, FP8 KV-cache, CPU offloading, disaggregated prefill/decode, and zero-reload sleep mode, with benchmark-backed guidance.

Feb 6·26 min read
Disaggregated Prefill-Decode: The Architecture Behind Meta's LLM Serving
Engineering

Disaggregated Prefill-Decode: The Architecture Behind Meta's LLM Serving

Part 1 of my LLM optimization research series. Exploring how Meta's disaggregated prefill-decode strategy separates prompt processing from token generation - and what it means for JarvisLabs.

Jan 29·11 min read
The Complete Guide to LLM Quantization with vLLM: Benchmarks & Best Practices
Engineering

The Complete Guide to LLM Quantization with vLLM: Benchmarks & Best Practices

Complete guide to LLM quantization with vLLM. Compare AWQ, GPTQ, Marlin, GGUF, and BitsandBytes with real benchmarks on Qwen2.5-32B using H200 GPU - 4-bit quantization tested for perplexity, HumanEval accuracy, and inference speed.

Jan 7·47 min read
Deploying MiniMax M2.1 with vLLM: Complete Guide for Agentic Workloads
Engineering

Deploying MiniMax M2.1 with vLLM: Complete Guide for Agentic Workloads

Learn how to deploy MiniMax M2.1 with vLLM for agentic workloads and coding assistants. Covers hardware requirements, tensor/expert parallelism, benchmarking on InstructCoder, tool calling with interleaved thinking, and integration with Claude Code, Cline, and Cursor.

Dec 26·10 min read
Speculative Decoding in vLLM: Complete Guide to Faster LLM Inference
Engineering

Speculative Decoding in vLLM: Complete Guide to Faster LLM Inference

Learn how to speed up LLM inference by 1.4-1.6x using speculative decoding in vLLM. This guide covers Draft Models, N-Gram Matching, Suffix Decoding, MLP Speculators, and EAGLE-3 with real benchmarks on Llama-3.1-8B and Llama-3.3-70B.

Dec 18·34 min read
CUDA Cores Explained
Engineering

CUDA Cores Explained

A deep dive into CUDA cores, Tensor Cores, precision modes, and other specialized GPU features that impact performance.

Dec 8·7 min read
How to Run FLUX AI Image Generator with ComfyUI: Complete Setup Guide
Engineering

How to Run FLUX AI Image Generator with ComfyUI: Complete Setup Guide

Step-by-step guide to set up and run FLUX.1 Schnell for AI image generation using ComfyUI on cloud GPUs. Includes workflows, LoRA integration, and practical examples.

Nov 20·4 min read
Uncensored LLM Models: A Complete Guide to Unfiltered AI Language Models
Engineering

Uncensored LLM Models: A Complete Guide to Unfiltered AI Language Models

Explore uncensored LLM models, their differences from ChatGPT, and how they're built. Learn about foundation models, fine-tuning, and running unfiltered AI models locally.

Nov 20·4 min read
ML Experiment Tracking: Complete Guide to W&B and Hydra
Engineering

ML Experiment Tracking: Complete Guide to W&B and Hydra

Learn how to effectively track and manage ML experiments using Weights & Biases (W&B) and Hydra. A comprehensive guide for machine learning practitioners and researchers.

Nov 20·22 min read
ComfyUI Prompt Enhancement Guide: Using Ollama and LLMs for Better AI Image Generation
Engineering

ComfyUI Prompt Enhancement Guide: Using Ollama and LLMs for Better AI Image Generation

Learn how to improve your Stable Diffusion prompts using Ollama and LLMs in ComfyUI. Step-by-step guide to setup, workflow, and best practices for enhanced AI image generation.

Nov 20·3 min read
Flux AI Image Generator Tutorial: Setup Guide for Cloud GPU (2024)
Engineering

Flux AI Image Generator Tutorial: Setup Guide for Cloud GPU (2024)

Step-by-step guide to install and run Flux.1 AI image generator on cloud GPU. Learn how to generate high-quality AI images using Flux's open-source model with detailed setup instructions and examples.

Aug 14·4 min read
Create AI Training Datasets with Fooocus: Face Swap and Pose Matching Guide
Engineering

Create AI Training Datasets with Fooocus: Face Swap and Pose Matching Guide

Step-by-step guide to creating custom AI training datasets using Fooocus's face swap and pose matching features for Stable Diffusion model finetuning

Mar 23·3 min read
How to Deploy and Connect with Ollama LLM Models: A Comprehensive Guide
Engineering

How to Deploy and Connect with Ollama LLM Models: A Comprehensive Guide

Learn how to effectively deploy and interact with Ollama LLM models using terminal commands, local clients, and REST APIs. Discover tips for choosing the right GPU, managing storage, and troubleshooting common issues.

Mar 18·3 min read
Boost PyTorch Performance with Hugging Face Accelerate: Multi-GPU & Mixed Precision Training
Engineering

Boost PyTorch Performance with Hugging Face Accelerate: Multi-GPU & Mixed Precision Training

Discover how to enhance your PyTorch scripts using Hugging Face Accelerate for efficient multi-GPU and mixed precision training. Learn setup, configuration, and code adaptation for faster deep learning model training.

Feb 7·5 min read
How to Train Billion-Parameter NLP Models on One GPU with DeepSpeed and HuggingFace
Engineering

How to Train Billion-Parameter NLP Models on One GPU with DeepSpeed and HuggingFace

Learn how to train large language models efficiently using DeepSpeed and HuggingFace Trainer. This step-by-step guide shows you how to optimize GPU memory and train 10B+ parameter models on a single GPU using ZeRO-Offload.

Feb 1·5 min read
Build a Toxic Comment Classifier with RoBERTa and PyTorch Lightning | Complete Tutorial
Engineering

Build a Toxic Comment Classifier with RoBERTa and PyTorch Lightning | Complete Tutorial

Learn how to build a toxic comment classifier using RoBERTa and PyTorch Lightning. This step-by-step tutorial covers mixed precision training, multi-GPU setup, and Weights & Biases integration for ML model tracking.

Dec 8·15 min read
ResNet-50 Performance Optimization: Modern Training Techniques to Achieve 80.4% Accuracy
Engineering

ResNet-50 Performance Optimization: Modern Training Techniques to Achieve 80.4% Accuracy

Learn how to boost ResNet-50 accuracy from 75.3% to 80.4% using advanced training techniques, including BCE loss, data augmentation, and optimization strategies. A comprehensive guide to modern CNN training best practices.

Oct 18·5 min read
JarvisLabs Blog — GPU Cloud Engineering & Guides | Jarvis Labs