Blog/Tutorial

How to Run PrismAudio on JarvisLabs

Vishnu Subramanian
Vishnu Subramanian

Founder @ JarvisLabs.ai

March 29, 2026·10 min read
How to Run PrismAudio on JarvisLabs

PrismAudio is a 518M parameter Video-to-Audio model accepted at ICLR 2026 that generates synchronized audio from silent video. Give it a clip of someone drumming on water bottles, and it produces the sound of tapping and splashing. The paper reports 0.63 seconds of generation time for a 9-second sample (excluding feature extraction), versus 1.30s for MMAudio and 1.07s for ThinkSound on the same measurement.

We ran it on a JarvisLabs A100. This post walks through the setup end to end, including two non-obvious blockers we had to resolve before it would run.

What is PrismAudio?

The authors describe PrismAudio as the first framework to integrate Reinforcement Learning into Video-to-Audio generation. It decomposes reasoning into four specialized Chain-of-Thought modules:

  • Semantic - what sounds should exist
  • Temporal - timing and rhythm
  • Aesthetic - audio quality and clarity
  • Spatial - where sounds come from in the stereo field

Each module has its own reward function, trained with a technique called Fast-GRPO that uses hybrid ODE-SDE sampling to keep RL training overhead low.

It tops all baselines on VGGSound (CLAP, DeSync, PQ, and subjective MOS scores) and their new AudioCanvas benchmark. PrismAudio builds on the ThinkSound framework (NeurIPS 2025) but is smaller (518M vs 1.3B params) and, on the paper's reported generation-time measurement for a 9-second sample excluding feature extraction, faster too (0.63s vs 1.07s).

Running PrismAudio on an A100

PrismAudio's feature extraction pipeline loads three large models simultaneously (T5-Gemma, VideoPrism, and Synchformer), so it needs a GPU with enough VRAM and system RAM. We went with an A100 in the IN2 region: 40GB VRAM, 112GB system RAM, $1.29/hr.

bash
jl create --gpu A100 --region IN2 --name prismaudio

Instance was up in seconds. The upstream repo ships a conda-based bootstrap (scripts/PrismAudio/setup/build_env.sh), but we wanted something lighter on the JarvisLabs image, so we used uv and ran the steps by hand. Everything below is that JarvisLabs-tested uv path, not the upstream quick start:

bash
cd /home
git clone -b prismaudio https://github.com/liuhuadai/ThinkSound.git
cd ThinkSound

# Create a virtual environment with uv
uv venv .venv --python 3.10
source .venv/bin/activate

# Install VideoPrism (Google's video encoder)
git clone https://github.com/google-deepmind/videoprism.git
cd videoprism && uv pip install . && cd ..

# Install all dependencies
uv pip install -r scripts/PrismAudio/setup/requirements.txt
uv pip install tensorflow-cpu==2.15.0
uv pip install facenet_pytorch==2.6.0 --no-deps

# Install FFmpeg system libraries (needed by torio)
apt-get update && apt-get install -y ffmpeg libavcodec-dev libavformat-dev libavutil-dev

# Download model weights (5.8GB)
git lfs install
git clone https://huggingface.co/FunAudioLLM/PrismAudio ckpts

Gotcha 1: HuggingFace Gated Model

Feature extraction died immediately:

GatedRepoError: 401 Client Error.
Cannot access gated repo for url https://huggingface.co/google/t5gemma-l-l-ul2-it
Access to model google/t5gemma-l-l-ul2-it is restricted.

PrismAudio uses Google's T5-Gemma as its text encoder for CoT descriptions. It's a gated model. You need to:

  1. Visit huggingface.co/google/t5gemma-l-l-ul2-it and accept the license
  2. Run huggingface-cli login with your token
bash
huggingface-cli login --token <your-hf-token>

Gotcha 2: FFmpeg Library Path

After fixing auth, we hit another error:

ERROR - Error loading demo: Failed to initialize FFmpeg extension.
Tried versions: ['6', '5', '4', ''].

PrismAudio's requirements install av==15.0.0, while torio (torchaudio's streaming decoder, used for video loading) resolves FFmpeg libraries separately and still needs compatible system FFmpeg libraries on the PATH. Two different FFmpeg resolution mechanisms in the same project, so the container's bare image is not enough.

The fix is straightforward:

bash
apt-get update && apt-get install -y ffmpeg libavcodec-dev libavformat-dev libavutil-dev

The Result

With both gotchas resolved, everything ran clean. Numbers below are from a fresh rerun on an IN2 A100 and will vary on your setup:

  • Feature extraction: the processing loop reported ~85 s/batch, with the full torchrun ... prismaudio_data_process.py command taking about 148 s wall-clock (model loads plus the actual encode).
  • Inference: predict.py reported about 1.28 seconds of sampling time in its final line (the script prints this in Chinese as 执行时间: 1.28 秒), while the full python predict.py ... command took about 22 s wall-clock once you include model loading and cleanup.
  • Output: a ~1.4MB WAV file at results/MMDD_batch_size1/demo.wav for our demo clip.
Predicting 1 samples with length 179 for ids: ['demo']
24it [00:01, 19.32it/s]
执行时间: 1.28 秒

Total compute cost for the entire experiment was still well under a dollar of A100 time.

What We Learned

  1. Multi-model pipelines need system RAM. PrismAudio loads T5-Gemma, VideoPrism, and Synchformer simultaneously during feature extraction. The A100's 112GB of system RAM handles this comfortably. When choosing a GPU for multi-model workloads, check both VRAM and system RAM specs.

  2. FFmpeg version compatibility is real. PyAV and torio resolve FFmpeg libraries through different paths. Install the system FFmpeg libraries via apt-get and you're good.

  3. Gated models are a silent dependency. PrismAudio's README mentions downloading its own weights, but the T5-Gemma encoder is fetched at runtime from HuggingFace. You need to accept the license AND authenticate before your first run.

  4. GPU switching is easy, but region-locked and not fully stateful. Need a different GPU? Pause the instance and resume it with a different GPU from the dashboard or CLI, and the work that lives in your home directory (including the uv venv under /home/ThinkSound) comes back with it. The things that do not persist are system packages you installed with apt-get, so after a resume you will need to rerun the apt-get install -y ffmpeg libavcodec-dev libavformat-dev libavutil-dev step before torio can load video. jl resume also keeps you in the original region, so the set of GPUs you can switch to is whatever that region offers. Check jl gpus before you pause so you know what's available.


Quick Start Recipe

If you just want PrismAudio running, here's the clean path. Tested on JarvisLabs A100 (IN2 region).

Prerequisites

To install the CLI:

bash
uv tool install jarvislabs
jl setup

Step 1: Create an A100 Instance

bash
jl create --gpu A100 --region IN2 --name prismaudio

Tip: An A100 (40GB VRAM, 112GB RAM, $1.29/hr) is the sweet spot for PrismAudio. If you want faster inference, you can pause and resume on an IN2 H100 and your home directory (including the uv venv) comes back with it. Apt-installed system packages like ffmpeg and libav* do not persist across resume, so you will need to rerun the FFmpeg apt-get install step before torio can load video again. H200 is currently only offered in EU1, so that's not a same-region swap from this tutorial.

Step 2: Install Everything

SSH into the instance with the CLI:

bash
jl ssh <instance-id>

If you want to run the install as a one-shot without attaching, you can also use jl exec <instance-id> -- bash -lc '...'. All commands below run inside the instance:

bash
# Clone PrismAudio
cd /home
git clone -b prismaudio https://github.com/liuhuadai/ThinkSound.git
cd ThinkSound

# Create isolated environment with uv
uv venv .venv --python 3.10
source .venv/bin/activate

# Install VideoPrism
git clone https://github.com/google-deepmind/videoprism.git
cd videoprism && uv pip install . && cd ..

# Install dependencies (already pulls huggingface-hub, which provides the `hf` CLI)
uv pip install -r scripts/PrismAudio/setup/requirements.txt
uv pip install tensorflow-cpu==2.15.0
uv pip install facenet_pytorch==2.6.0 --no-deps

# Install FFmpeg system libraries (needed by torio)
apt-get update && apt-get install -y ffmpeg libavcodec-dev libavformat-dev libavutil-dev

# Login to HuggingFace (required for the gated T5-Gemma encoder)
huggingface-cli login --token <your-hf-token>

# Download model weights (5.8GB)
git lfs install
git clone https://huggingface.co/FunAudioLLM/PrismAudio ckpts

Step 3: Run Inference

bash
source .venv/bin/activate
export TF_CPP_MIN_LOG_LEVEL=2

# Prepare your video
mkdir -p videos cot_coarse results
cp /path/to/your/video.mp4 videos/demo.mp4

# Create CoT description
echo "id,caption_cot" > cot_coarse/cot.csv
echo 'demo,"Semantic: describe the sounds. Temporal: describe the rhythm. Aesthetic: describe the audio quality. Spatial: describe where sounds come from."' >> cot_coarse/cot.csv

# Extract features (~2-3 minutes of wall-clock on A100, including model loads)
torchrun --nproc_per_node=1 data_utils/prismaudio_data_process.py --inference_mode True

# Get video duration
DURATION=$(ffprobe -v error -show_entries format=duration -of default=noprint_wrappers=1:nokey=1 videos/demo.mp4)

# Run inference (sampling ~1.3 s; full command around 20 s once model loads count)
python predict.py \
    --model-config "PrismAudio/configs/model_configs/prismaudio.json" \
    --duration-sec "$DURATION" \
    --ckpt-dir "ckpts/prismaudio.ckpt" \
    --results-dir "results"

Your generated audio is at results/MMDD_batch_size1/demo.wav.

Step 4: Launch the Gradio Web UI

For interactive use with a browser-based interface:

bash
source .venv/bin/activate
export TF_CPP_MIN_LOG_LEVEL=2
export GRADIO_TEMP_DIR=/tmp/gradio_temp
mkdir -p /tmp/gradio_temp

python app.py --server_name 0.0.0.0 --server_port 6006

app.py boots without GRADIO_TEMP_DIR being set, but the generation path in the app reads it when it writes temp files, so set it up front to avoid a KeyError on your first generation request.

The Gradio app loads all models at startup (~2 minutes), then serves on port 6006. On JarvisLabs, port 6006 is automatically exposed as an API endpoint through Cloudflare. Click the API button on your instance in the dashboard, then click API 1 to open the Gradio interface in your browser.

Upload any video, write a CoT description covering the four dimensions (Semantic, Temporal, Aesthetic, Spatial), and hit generate. In our runs, feature extraction took roughly 85 seconds inside the processing loop (full command wall-clock closer to 2-3 minutes on a cold load), and sampling itself was ~1.3 seconds with the full predict command around 20 seconds wall-clock.

Step 5: Clean Up

When you're done, pause the instance to stop billing:

bash
jl pause <instance-id>

Or destroy it entirely:

bash
jl destroy <instance-id>

Common Issues

IssueCauseFix
GatedRepoError: 401T5-Gemma is a gated modelAccept license at HuggingFace, then huggingface-cli login
Failed to initialize FFmpeg extensiontorio resolves FFmpeg separately from PyAV; needs system FFmpeg 4-6apt-get install -y ffmpeg libavcodec-dev libavformat-dev libavutil-dev
Gradio KeyError: 'GRADIO_TEMP_DIR'Missing env varexport GRADIO_TEMP_DIR=/tmp/gradio_temp && mkdir -p /tmp/gradio_temp

Cost Summary

GPURegionVRAMRAM$/hrNotes
A100IN240GB112GB$1.29Recommended. Tested; ~1.3s sampling, ~20s full predict command.
H100IN2 or EU180GB200GB$2.69 (IN2) / $2.99 (EU1)Faster inference if you need it; IN2 is a same-region swap from this tutorial.
H200EU1 only141GB200GB$3.80Maximum headroom. Currently EU1-only, so not a pause-and-resume upgrade from an IN2 A100.

Total compute cost for this experiment on A100: under $1.


PrismAudio is a compact V2A model at 518M parameters with fast reported generation times and four-dimensional CoT reasoning that ties audio to the video's content. Licensing is split between the code repo and the HuggingFace model card, so check both before planning any production use.

If you want to try it on JarvisLabs, spin up an A100 in IN2 and follow the recipe above. Get started at jarvislabs.ai.

Get Started

Build & Deploy Your AI in Minutes

Cloud GPU infrastructure designed specifically for AI development. Start training and deploying models today.

View Pricing