Blog/Tutorial

Run Qwen3.6 MTP with llama.cpp on RTX PRO 6000

Team JarvisLabs

JarvisLabs.ai

May 19, 2026·9 min read

Run Qwen3.6 MTP with llama.cpp on RTX PRO 6000

The RTX PRO 6000 Blackwell Server Edition is now available on JarvisLabs, with 96 GB of VRAM and 1597 GB/s of memory bandwidth on a single card. That's enough to fit Qwen3.6 27B Dense or the 35B-A3B MoE at Q8_0 (8-bit quantization) without splitting weights across GPUs.

Good timing too, because llama.cpp just added support for Qwen3.6's MTP heads. It's basically a free speedup: the model drafts a few tokens ahead and checks them in one pass, and you turn it on with two flags.

We ran the numbers on one card: 1.73x faster on Qwen3.6 27B Dense, 1.17x on the 35B-A3B MoE.

Model	Baseline	MTP	Speedup
Qwen3.6 27B Dense Q8_0	45.76 tok/s	79.37 tok/s	1.73x
Qwen3.6 35B-A3B MoE Q8_0	193.36 tok/s	225.48 tok/s	1.17x

This tutorial was prompted by Julien Chaumond's short LinkedIn walkthrough for running the new Qwen3.6 MTP GGUF models in llama.cpp. We ran it on a JarvisLabs RTX PRO 6000.

What you need

Use an RTX PRO 6000 with enough disk for the source build and GGUF downloads. For this tutorial, use 160 GB storage.

Item	Value
GPU	RTX PRO 6000 Blackwell Server Edition
Region used	IN1
Storage	160 GB
Server	`llama-server` from latest `llama.cpp` source
Dense model	`ggml-org/Qwen3.6-27B-MTP-GGUF`
MoE model	`ggml-org/Qwen3.6-35B-A3B-MTP-GGUF`

The files below use /home, so the source checkout, build, Hugging Face cache, and benchmark output persist when the instance is paused and resumed.

Excluding the llama.cpp build, the benchmark itself runs in about 8 minutes on a fresh instance. That covers downloading both GGUFs, starting each model, a warmup request per case, and four benchmark passes: dense baseline, dense MTP, MoE baseline, and MoE MTP.

Cached reruns are much faster because /home persists the source checkout, build output, Hugging Face cache, and result files across pauses and resumes.

Create the GPU instance

You can spin one up from the JarvisLabs dashboard, or use our CLI if you prefer the terminal. The rest of this tutorial uses the CLI.

Install and authenticate the JarvisLabs CLI:

bash

uv tool install jarvislabs
jl setup

Create an RTX PRO 6000 instance:

bash

jl create \
  --gpu RTX-PRO6000 \
  --region IN1 \
  --storage 160 \
  --name qwen36-mtp \
  --http-ports 8080

SSH into the instance:

bash

jl ssh <machine_id>

The remaining commands run inside the JarvisLabs instance once you've SSH'd into it.

Build llama.cpp

Install build dependencies:

bash

export DEBIAN_FRONTEND=noninteractive

apt-get update
apt-get install -y --no-install-recommends \
  build-essential \
  ca-certificates \
  cmake \
  curl \
  git \
  jq \
  libcurl4-openssl-dev \
  libssl-dev \
  python3

Clone and build llama-server:

bash

if [ ! -d /home/llama.cpp/.git ]; then
  git clone --depth 1 https://github.com/ggml-org/llama.cpp /home/llama.cpp
fi

cd /home/llama.cpp
git fetch --depth 1 origin master
git checkout FETCH_HEAD
git rev-parse HEAD

cmake -S . -B build \
  -DCMAKE_BUILD_TYPE=Release \
  -DGGML_CUDA=ON \
  -DLLAMA_OPENSSL=ON \
  -DLLAMA_BUILD_TESTS=OFF

cmake --build build --config Release -j"$(nproc)" --target llama-server

We used llama.cpp commit d14ce3dab4de197adec5166faa54ac5db8262f26 in our run. You do not need that exact commit, but you do want a recent source build because Qwen3.6 MTP support is new.

Run Qwen3.6 27B Dense with MTP

Start the dense model with MTP:

bash

/home/llama.cpp/build/bin/llama-server \
  -hf ggml-org/Qwen3.6-27B-MTP-GGUF \
  -ngl 999 \
  -c 4096 \
  --host 0.0.0.0 \
  --port 8080 \
  --jinja \
  --spec-type draft-mtp \
  --spec-draft-n-max 2

The first launch downloads the model into the Hugging Face cache. Wait until the logs say the server is listening on port 8080.

In another SSH session, send a test completion:

bash

curl -s http://127.0.0.1:8080/completion \
  -H 'Content-Type: application/json' \
  -d '{
    "prompt": "Explain multi-token prediction in one paragraph.",
    "n_predict": 256,
    "temperature": 0,
    "stream": false
  }' | jq '.timings'

Look for predicted_per_second in the output. On our 512-token benchmark prompt, the dense MTP run averaged 79.37 tok/s after warmup. Your number can differ with prompt length, output length, quantization, server settings, and current llama.cpp build.

Compare against baseline

Stop the MTP server:

bash

pkill -f llama-server

Start the same dense model without MTP by removing the speculative flags:

bash

/home/llama.cpp/build/bin/llama-server \
  -hf ggml-org/Qwen3.6-27B-MTP-GGUF \
  -ngl 999 \
  -c 4096 \
  --host 0.0.0.0 \
  --port 8080 \
  --jinja

Run the same curl request again. In our run, the dense baseline averaged 45.76 tok/s, while MTP averaged 79.37 tok/s.

Run	Tokens	Baseline tok/s	MTP tok/s
1	512	45.62	79.41
2	512	45.79	79.41
3	512	45.86	79.29
Average	512	45.76	79.37

That is a 1.73x generation speedup on this run.

Try the MoE model

The 35B-A3B MoE model is already much faster than the dense model. For MTP, use --spec-draft-n-max 3:

bash

pkill -f llama-server

/home/llama.cpp/build/bin/llama-server \
  -hf ggml-org/Qwen3.6-35B-A3B-MTP-GGUF \
  -ngl 999 \
  -c 4096 \
  --host 0.0.0.0 \
  --port 8080 \
  --jinja \
  --spec-type draft-mtp \
  --spec-draft-n-max 3

In our run, MTP still helped the MoE model, but the relative gain was smaller:

Model	Baseline avg	MTP avg	Speedup
Qwen3.6 35B-A3B MoE Q8_0	193.36 tok/s	225.48 tok/s	1.17x

Why does the MoE gain so much less? The 35B-A3B model has 35B total parameters but only 3B active per token, so its baseline decoding is already cheap. Speculative decoding works by saving target-model cost, and when that cost is already small, there is less left for MTP to save. We saw the same pattern when we benchmarked MTP on Gemma 4: the dense 31B got a big MTP boost, while the 26B-A4B MoE gained much less for the same reason.

llama.cpp reports draft_n and draft_n_accepted in the /completion response timings, and our benchmark script logs them to the CSV. From our 512-token requests:

Model	n_max	draft_n	accepted	Acceptance
Qwen3.6 27B Dense	2	442	289	65%
Qwen3.6 35B-A3B MoE	3	578	318	55%

Acceptance is workload-dependent. Code and math prompts can push it above 80%, and open prose pulls it lower. The Hugging Face model cards recommend --spec-draft-n-max 2 or 3. Higher values give diminishing returns as more drafts get rejected.

So treat MTP as workload-dependent. The multiplier moves with model size, active params, quantization, draft-n setting, prompt shape, and hardware. On this RTX PRO 6000 run, dense Qwen3.6 gained much more than the MoE in relative terms.

Run the full benchmark script

If you want to reproduce our four-way comparison, save this locally as bench_qwen36_mtp.sh and run it with jl run. It builds llama.cpp, runs dense baseline, dense MTP, MoE baseline, and MoE MTP, and writes a CSV for each case.

bash

#!/usr/bin/env bash
set -euo pipefail

export DEBIAN_FRONTEND=noninteractive

apt-get update
apt-get install -y --no-install-recommends \
  build-essential ca-certificates cmake curl git jq \
  libcurl4-openssl-dev libssl-dev python3

if [ ! -d /home/llama.cpp/.git ]; then
  git clone --depth 1 https://github.com/ggml-org/llama.cpp /home/llama.cpp
fi

cd /home/llama.cpp
git fetch --depth 1 origin master
git checkout FETCH_HEAD
git rev-parse HEAD

cmake -S . -B build \
  -DCMAKE_BUILD_TYPE=Release \
  -DGGML_CUDA=ON \
  -DLLAMA_OPENSSL=ON \
  -DLLAMA_BUILD_TESTS=OFF
cmake --build build --config Release -j"$(nproc)" --target llama-server

SERVER=/home/llama.cpp/build/bin/llama-server
RESULT_DIR=/home/qwen36-mtp-results
mkdir -p "$RESULT_DIR"

PROMPT='Write a practical, technical explanation of multi-token prediction for local LLM inference. Include what changes in the decoding loop, why throughput improves, when quality can regress, and how an engineer should benchmark it on a single GPU.'

start_server() {
  local name="$1"
  local repo="$2"
  local spec_n="$3"

  pkill -f llama-server || true
  sleep 3

  local args=(
    "$SERVER"
    -hf "$repo"
    -ngl 999
    -c 4096
    --host 127.0.0.1
    --port 8080
    --jinja
  )

  if [ "$spec_n" != "0" ]; then
    args+=(--spec-type draft-mtp --spec-draft-n-max "$spec_n")
  fi

  echo "=== ${name} ==="
  "${args[@]}" >"$RESULT_DIR/${name}_server.log" 2>&1 &
  echo $! > "$RESULT_DIR/${name}.pid"

  for _ in $(seq 1 900); do
    if curl -sf http://127.0.0.1:8080/health >/dev/null; then
      echo "server ready: ${name}"
      return 0
    fi
    sleep 2
  done

  echo "server did not become ready: ${name}"
  tail -200 "$RESULT_DIR/${name}_server.log"
  exit 1
}

run_completion() {
  local name="$1"
  local run_idx="$2"
  local out_file="$RESULT_DIR/${name}_run${run_idx}.json"

  jq -n --arg prompt "$PROMPT" '{
    prompt: $prompt,
    n_predict: 512,
    temperature: 0,
    cache_prompt: false,
    stream: false
  }' | curl -sS http://127.0.0.1:8080/completion \
      -H 'Content-Type: application/json' \
      --data-binary @- > "$out_file"

  jq -r --arg name "$name" --arg run "$run_idx" '
    .timings as $t |
    [$name, $run,
     ($t.predicted_n // 0),
     ($t.predicted_ms // 0),
     ($t.predicted_per_second // 0),
     ($t.draft_n // 0),
     ($t.draft_n_accepted // 0)] | @csv
  ' "$out_file"
}

benchmark_case() {
  local name="$1"
  local repo="$2"
  local spec_n="$3"

  start_server "$name" "$repo" "$spec_n"

  echo "model,run,predicted_tokens,predicted_ms,predicted_tok_s,draft_n,draft_n_accepted" | tee "$RESULT_DIR/${name}.csv"
  run_completion "$name" "warmup" | tee -a "$RESULT_DIR/${name}.csv"
  for run in 1 2 3; do
    run_completion "$name" "$run" | tee -a "$RESULT_DIR/${name}.csv"
  done

  pkill -F "$RESULT_DIR/${name}.pid" || true
  sleep 3
}

benchmark_case dense_baseline ggml-org/Qwen3.6-27B-MTP-GGUF 0
benchmark_case dense_mtp ggml-org/Qwen3.6-27B-MTP-GGUF 2
benchmark_case moe_baseline ggml-org/Qwen3.6-35B-A3B-MTP-GGUF 0
benchmark_case moe_mtp ggml-org/Qwen3.6-35B-A3B-MTP-GGUF 3

Run it on one RTX PRO 6000. jl run takes a local script, spins up a fresh instance, uploads the file, runs it on the GPU, and pauses the instance when it's done — so you don't have to manage the lifecycle yourself.

bash

chmod +x bench_qwen36_mtp.sh

jl run bench_qwen36_mtp.sh \
  --gpu RTX-PRO6000 \
  --region IN1 \
  --storage 160 \
  --name qwen36-mtp-rtxpro6000

The CSV files are written under /home/qwen36-mtp-results on the instance. Pull them down to your machine with:

bash

jl download <machine_id> /home/qwen36-mtp-results ./qwen36-mtp-results -r

Wrapping up

MTP in llama.cpp is basically a free speedup if your model ships the heads. On a single RTX PRO 6000 we measured 1.73x on Qwen3.6 27B Dense and 1.17x on the 35B-A3B MoE.

Both models fit on a single 96 GB card with room to spare, which is why the benchmark stays in one file. No tensor parallelism, no weight offload, just one flag to swap models. If you work with dense or MoE models in this size class, a single big card removes a lot of orchestration work. The RTX PRO 6000 is available on JarvisLabs.