Run Qwen3.6 MTP with llama.cpp on RTX PRO 6000

JarvisLabs.ai

The RTX PRO 6000 Blackwell Server Edition is now available on JarvisLabs, with 96 GB of VRAM and 1597 GB/s of memory bandwidth on a single card. That's enough to fit Qwen3.6 27B Dense or the 35B-A3B MoE at Q8_0 (8-bit quantization) without splitting weights across GPUs.
Good timing too, because llama.cpp just added support for Qwen3.6's MTP heads. It's basically a free speedup: the model drafts a few tokens ahead and checks them in one pass, and you turn it on with two flags.
We ran the numbers on one card: 1.73x faster on Qwen3.6 27B Dense, 1.17x on the 35B-A3B MoE.
| Model | Baseline | MTP | Speedup |
|---|---|---|---|
| Qwen3.6 27B Dense Q8_0 | 45.76 tok/s | 79.37 tok/s | 1.73x |
| Qwen3.6 35B-A3B MoE Q8_0 | 193.36 tok/s | 225.48 tok/s | 1.17x |
This tutorial was prompted by Julien Chaumond's short LinkedIn walkthrough for running the new Qwen3.6 MTP GGUF models in llama.cpp. We ran it on a JarvisLabs RTX PRO 6000.
What you need
Use an RTX PRO 6000 with enough disk for the source build and GGUF downloads. For this tutorial, use 160 GB storage.
| Item | Value |
|---|---|
| GPU | RTX PRO 6000 Blackwell Server Edition |
| Region used | IN1 |
| Storage | 160 GB |
| Server | llama-server from latest llama.cpp source |
| Dense model | ggml-org/Qwen3.6-27B-MTP-GGUF |
| MoE model | ggml-org/Qwen3.6-35B-A3B-MTP-GGUF |
The files below use /home, so the source checkout, build, Hugging Face cache, and benchmark output persist when the instance is paused and resumed.
Excluding the llama.cpp build, the benchmark itself runs in about 8 minutes on a fresh instance. That covers downloading both GGUFs, starting each model, a warmup request per case, and four benchmark passes: dense baseline, dense MTP, MoE baseline, and MoE MTP.
Cached reruns are much faster because /home persists the source checkout, build output, Hugging Face cache, and result files across pauses and resumes.
Create the GPU instance
You can spin one up from the JarvisLabs dashboard, or use our CLI if you prefer the terminal. The rest of this tutorial uses the CLI.
Install and authenticate the JarvisLabs CLI:
uv tool install jarvislabs
jl setupCreate an RTX PRO 6000 instance:
jl create \
--gpu RTX-PRO6000 \
--region IN1 \
--storage 160 \
--name qwen36-mtp \
--http-ports 8080SSH into the instance:
jl ssh <machine_id>The remaining commands run inside the JarvisLabs instance once you've SSH'd into it.
Build llama.cpp
Install build dependencies:
export DEBIAN_FRONTEND=noninteractive
apt-get update
apt-get install -y --no-install-recommends \
build-essential \
ca-certificates \
cmake \
curl \
git \
jq \
libcurl4-openssl-dev \
libssl-dev \
python3Clone and build llama-server:
if [ ! -d /home/llama.cpp/.git ]; then
git clone --depth 1 https://github.com/ggml-org/llama.cpp /home/llama.cpp
fi
cd /home/llama.cpp
git fetch --depth 1 origin master
git checkout FETCH_HEAD
git rev-parse HEAD
cmake -S . -B build \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_CUDA=ON \
-DLLAMA_OPENSSL=ON \
-DLLAMA_BUILD_TESTS=OFF
cmake --build build --config Release -j"$(nproc)" --target llama-serverWe used llama.cpp commit d14ce3dab4de197adec5166faa54ac5db8262f26 in our run. You do not need that exact commit, but you do want a recent source build because Qwen3.6 MTP support is new.
Run Qwen3.6 27B Dense with MTP
Start the dense model with MTP:
/home/llama.cpp/build/bin/llama-server \
-hf ggml-org/Qwen3.6-27B-MTP-GGUF \
-ngl 999 \
-c 4096 \
--host 0.0.0.0 \
--port 8080 \
--jinja \
--spec-type draft-mtp \
--spec-draft-n-max 2The first launch downloads the model into the Hugging Face cache. Wait until the logs say the server is listening on port 8080.
In another SSH session, send a test completion:
curl -s http://127.0.0.1:8080/completion \
-H 'Content-Type: application/json' \
-d '{
"prompt": "Explain multi-token prediction in one paragraph.",
"n_predict": 256,
"temperature": 0,
"stream": false
}' | jq '.timings'Look for predicted_per_second in the output. On our 512-token benchmark prompt, the dense MTP run averaged 79.37 tok/s after warmup. Your number can differ with prompt length, output length, quantization, server settings, and current llama.cpp build.
Compare against baseline
Stop the MTP server:
pkill -f llama-serverStart the same dense model without MTP by removing the speculative flags:
/home/llama.cpp/build/bin/llama-server \
-hf ggml-org/Qwen3.6-27B-MTP-GGUF \
-ngl 999 \
-c 4096 \
--host 0.0.0.0 \
--port 8080 \
--jinjaRun the same curl request again. In our run, the dense baseline averaged 45.76 tok/s, while MTP averaged 79.37 tok/s.
| Run | Tokens | Baseline tok/s | MTP tok/s |
|---|---|---|---|
| 1 | 512 | 45.62 | 79.41 |
| 2 | 512 | 45.79 | 79.41 |
| 3 | 512 | 45.86 | 79.29 |
| Average | 512 | 45.76 | 79.37 |
That is a 1.73x generation speedup on this run.
Try the MoE model
The 35B-A3B MoE model is already much faster than the dense model. For MTP, use --spec-draft-n-max 3:
pkill -f llama-server
/home/llama.cpp/build/bin/llama-server \
-hf ggml-org/Qwen3.6-35B-A3B-MTP-GGUF \
-ngl 999 \
-c 4096 \
--host 0.0.0.0 \
--port 8080 \
--jinja \
--spec-type draft-mtp \
--spec-draft-n-max 3In our run, MTP still helped the MoE model, but the relative gain was smaller:
| Model | Baseline avg | MTP avg | Speedup |
|---|---|---|---|
| Qwen3.6 35B-A3B MoE Q8_0 | 193.36 tok/s | 225.48 tok/s | 1.17x |
Why does the MoE gain so much less? The 35B-A3B model has 35B total parameters but only 3B active per token, so its baseline decoding is already cheap. Speculative decoding works by saving target-model cost, and when that cost is already small, there is less left for MTP to save. We saw the same pattern when we benchmarked MTP on Gemma 4: the dense 31B got a big MTP boost, while the 26B-A4B MoE gained much less for the same reason.
llama.cpp reports draft_n and draft_n_accepted in the /completion response timings, and our benchmark script logs them to the CSV. From our 512-token requests:
| Model | n_max | draft_n | accepted | Acceptance |
|---|---|---|---|---|
| Qwen3.6 27B Dense | 2 | 442 | 289 | 65% |
| Qwen3.6 35B-A3B MoE | 3 | 578 | 318 | 55% |
Acceptance is workload-dependent. Code and math prompts can push it above 80%, and open prose pulls it lower. The Hugging Face model cards recommend --spec-draft-n-max 2 or 3. Higher values give diminishing returns as more drafts get rejected.
So treat MTP as workload-dependent. The multiplier moves with model size, active params, quantization, draft-n setting, prompt shape, and hardware. On this RTX PRO 6000 run, dense Qwen3.6 gained much more than the MoE in relative terms.
Run the full benchmark script
If you want to reproduce our four-way comparison, save this locally as bench_qwen36_mtp.sh and run it with jl run. It builds llama.cpp, runs dense baseline, dense MTP, MoE baseline, and MoE MTP, and writes a CSV for each case.
#!/usr/bin/env bash
set -euo pipefail
export DEBIAN_FRONTEND=noninteractive
apt-get update
apt-get install -y --no-install-recommends \
build-essential ca-certificates cmake curl git jq \
libcurl4-openssl-dev libssl-dev python3
if [ ! -d /home/llama.cpp/.git ]; then
git clone --depth 1 https://github.com/ggml-org/llama.cpp /home/llama.cpp
fi
cd /home/llama.cpp
git fetch --depth 1 origin master
git checkout FETCH_HEAD
git rev-parse HEAD
cmake -S . -B build \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_CUDA=ON \
-DLLAMA_OPENSSL=ON \
-DLLAMA_BUILD_TESTS=OFF
cmake --build build --config Release -j"$(nproc)" --target llama-server
SERVER=/home/llama.cpp/build/bin/llama-server
RESULT_DIR=/home/qwen36-mtp-results
mkdir -p "$RESULT_DIR"
PROMPT='Write a practical, technical explanation of multi-token prediction for local LLM inference. Include what changes in the decoding loop, why throughput improves, when quality can regress, and how an engineer should benchmark it on a single GPU.'
start_server() {
local name="$1"
local repo="$2"
local spec_n="$3"
pkill -f llama-server || true
sleep 3
local args=(
"$SERVER"
-hf "$repo"
-ngl 999
-c 4096
--host 127.0.0.1
--port 8080
--jinja
)
if [ "$spec_n" != "0" ]; then
args+=(--spec-type draft-mtp --spec-draft-n-max "$spec_n")
fi
echo "=== ${name} ==="
"${args[@]}" >"$RESULT_DIR/${name}_server.log" 2>&1 &
echo $! > "$RESULT_DIR/${name}.pid"
for _ in $(seq 1 900); do
if curl -sf http://127.0.0.1:8080/health >/dev/null; then
echo "server ready: ${name}"
return 0
fi
sleep 2
done
echo "server did not become ready: ${name}"
tail -200 "$RESULT_DIR/${name}_server.log"
exit 1
}
run_completion() {
local name="$1"
local run_idx="$2"
local out_file="$RESULT_DIR/${name}_run${run_idx}.json"
jq -n --arg prompt "$PROMPT" '{
prompt: $prompt,
n_predict: 512,
temperature: 0,
cache_prompt: false,
stream: false
}' | curl -sS http://127.0.0.1:8080/completion \
-H 'Content-Type: application/json' \
--data-binary @- > "$out_file"
jq -r --arg name "$name" --arg run "$run_idx" '
.timings as $t |
[$name, $run,
($t.predicted_n // 0),
($t.predicted_ms // 0),
($t.predicted_per_second // 0),
($t.draft_n // 0),
($t.draft_n_accepted // 0)] | @csv
' "$out_file"
}
benchmark_case() {
local name="$1"
local repo="$2"
local spec_n="$3"
start_server "$name" "$repo" "$spec_n"
echo "model,run,predicted_tokens,predicted_ms,predicted_tok_s,draft_n,draft_n_accepted" | tee "$RESULT_DIR/${name}.csv"
run_completion "$name" "warmup" | tee -a "$RESULT_DIR/${name}.csv"
for run in 1 2 3; do
run_completion "$name" "$run" | tee -a "$RESULT_DIR/${name}.csv"
done
pkill -F "$RESULT_DIR/${name}.pid" || true
sleep 3
}
benchmark_case dense_baseline ggml-org/Qwen3.6-27B-MTP-GGUF 0
benchmark_case dense_mtp ggml-org/Qwen3.6-27B-MTP-GGUF 2
benchmark_case moe_baseline ggml-org/Qwen3.6-35B-A3B-MTP-GGUF 0
benchmark_case moe_mtp ggml-org/Qwen3.6-35B-A3B-MTP-GGUF 3Run it on one RTX PRO 6000. jl run takes a local script, spins up a fresh instance, uploads the file, runs it on the GPU, and pauses the instance when it's done โ so you don't have to manage the lifecycle yourself.
chmod +x bench_qwen36_mtp.sh
jl run bench_qwen36_mtp.sh \
--gpu RTX-PRO6000 \
--region IN1 \
--storage 160 \
--name qwen36-mtp-rtxpro6000The CSV files are written under /home/qwen36-mtp-results on the instance. Pull them down to your machine with:
jl download <machine_id> /home/qwen36-mtp-results ./qwen36-mtp-results -rWrapping up
MTP in llama.cpp is basically a free speedup if your model ships the heads. On a single RTX PRO 6000 we measured 1.73x on Qwen3.6 27B Dense and 1.17x on the 35B-A3B MoE.
Both models fit on a single 96 GB card with room to spare, which is why the benchmark stays in one file. No tensor parallelism, no weight offload, just one flag to swap models. If you work with dense or MoE models in this size class, a single big card removes a lot of orchestration work. The RTX PRO 6000 is available on JarvisLabs.
Sources
Get Started
Build & Deploy Your AI in Minutes
Cloud GPU infrastructure designed specifically for AI development. Start training and deploying models today.
View Pricing

