How to Run PrismAudio on JarvisLabs

PrismAudio is a 518M parameter Video-to-Audio model accepted at ICLR 2026 that generates synchronized audio from silent video. Give it a clip of someone drumming on water bottles, and it produces the sound of tapping and splashing. The paper reports 0.63 seconds of generation time for a 9-second sample (excluding feature extraction), versus 1.30s for MMAudio and 1.07s for ThinkSound on the same measurement.
We ran it on a JarvisLabs A100. This post walks through the setup end to end, including two non-obvious blockers we had to resolve before it would run.
What is PrismAudio?
The authors describe PrismAudio as the first framework to integrate Reinforcement Learning into Video-to-Audio generation. It decomposes reasoning into four specialized Chain-of-Thought modules:
- Semantic - what sounds should exist
- Temporal - timing and rhythm
- Aesthetic - audio quality and clarity
- Spatial - where sounds come from in the stereo field
Each module has its own reward function, trained with a technique called Fast-GRPO that uses hybrid ODE-SDE sampling to keep RL training overhead low.
It tops all baselines on VGGSound (CLAP, DeSync, PQ, and subjective MOS scores) and their new AudioCanvas benchmark. PrismAudio builds on the ThinkSound framework (NeurIPS 2025) but is smaller (518M vs 1.3B params) and, on the paper's reported generation-time measurement for a 9-second sample excluding feature extraction, faster too (0.63s vs 1.07s).
- GitHub: FunAudioLLM/ThinkSound (prismaudio branch)
- Paper: arxiv.org/abs/2511.18833
- Weights: HuggingFace | ModelScope
Running PrismAudio on an A100
PrismAudio's feature extraction pipeline loads three large models simultaneously (T5-Gemma, VideoPrism, and Synchformer), so it needs a GPU with enough VRAM and system RAM. We went with an A100 in the IN2 region: 40GB VRAM, 112GB system RAM, $1.29/hr.
jl create --gpu A100 --region IN2 --name prismaudioInstance was up in seconds. The upstream repo ships a conda-based bootstrap (scripts/PrismAudio/setup/build_env.sh), but we wanted something lighter on the JarvisLabs image, so we used uv and ran the steps by hand. Everything below is that JarvisLabs-tested uv path, not the upstream quick start:
cd /home
git clone -b prismaudio https://github.com/liuhuadai/ThinkSound.git
cd ThinkSound
# Create a virtual environment with uv
uv venv .venv --python 3.10
source .venv/bin/activate
# Install VideoPrism (Google's video encoder)
git clone https://github.com/google-deepmind/videoprism.git
cd videoprism && uv pip install . && cd ..
# Install all dependencies
uv pip install -r scripts/PrismAudio/setup/requirements.txt
uv pip install tensorflow-cpu==2.15.0
uv pip install facenet_pytorch==2.6.0 --no-deps
# Install FFmpeg system libraries (needed by torio)
apt-get update && apt-get install -y ffmpeg libavcodec-dev libavformat-dev libavutil-dev
# Download model weights (5.8GB)
git lfs install
git clone https://huggingface.co/FunAudioLLM/PrismAudio ckptsGotcha 1: HuggingFace Gated Model
Feature extraction died immediately:
GatedRepoError: 401 Client Error.
Cannot access gated repo for url https://huggingface.co/google/t5gemma-l-l-ul2-it
Access to model google/t5gemma-l-l-ul2-it is restricted.PrismAudio uses Google's T5-Gemma as its text encoder for CoT descriptions. It's a gated model. You need to:
- Visit huggingface.co/google/t5gemma-l-l-ul2-it and accept the license
- Run
huggingface-cli loginwith your token
huggingface-cli login --token <your-hf-token>Gotcha 2: FFmpeg Library Path
After fixing auth, we hit another error:
ERROR - Error loading demo: Failed to initialize FFmpeg extension.
Tried versions: ['6', '5', '4', ''].PrismAudio's requirements install av==15.0.0, while torio (torchaudio's streaming decoder, used for video loading) resolves FFmpeg libraries separately and still needs compatible system FFmpeg libraries on the PATH. Two different FFmpeg resolution mechanisms in the same project, so the container's bare image is not enough.
The fix is straightforward:
apt-get update && apt-get install -y ffmpeg libavcodec-dev libavformat-dev libavutil-devThe Result
With both gotchas resolved, everything ran clean. Numbers below are from a fresh rerun on an IN2 A100 and will vary on your setup:
- Feature extraction: the processing loop reported ~85 s/batch, with the full
torchrun ... prismaudio_data_process.pycommand taking about 148 s wall-clock (model loads plus the actual encode). - Inference:
predict.pyreported about 1.28 seconds of sampling time in its final line (the script prints this in Chinese as执行时间: 1.28 秒), while the fullpython predict.py ...command took about 22 s wall-clock once you include model loading and cleanup. - Output: a ~1.4MB WAV file at
results/MMDD_batch_size1/demo.wavfor our demo clip.
Predicting 1 samples with length 179 for ids: ['demo']
24it [00:01, 19.32it/s]
执行时间: 1.28 秒Total compute cost for the entire experiment was still well under a dollar of A100 time.
What We Learned
-
Multi-model pipelines need system RAM. PrismAudio loads T5-Gemma, VideoPrism, and Synchformer simultaneously during feature extraction. The A100's 112GB of system RAM handles this comfortably. When choosing a GPU for multi-model workloads, check both VRAM and system RAM specs.
-
FFmpeg version compatibility is real. PyAV and torio resolve FFmpeg libraries through different paths. Install the system FFmpeg libraries via
apt-getand you're good. -
Gated models are a silent dependency. PrismAudio's README mentions downloading its own weights, but the T5-Gemma encoder is fetched at runtime from HuggingFace. You need to accept the license AND authenticate before your first run.
-
GPU switching is easy, but region-locked and not fully stateful. Need a different GPU? Pause the instance and resume it with a different GPU from the dashboard or CLI, and the work that lives in your home directory (including the
uvvenv under/home/ThinkSound) comes back with it. The things that do not persist are system packages you installed withapt-get, so after a resume you will need to rerun theapt-get install -y ffmpeg libavcodec-dev libavformat-dev libavutil-devstep beforetoriocan load video.jl resumealso keeps you in the original region, so the set of GPUs you can switch to is whatever that region offers. Checkjl gpusbefore you pause so you know what's available.
Quick Start Recipe
If you just want PrismAudio running, here's the clean path. Tested on JarvisLabs A100 (IN2 region).
Prerequisites
- A JarvisLabs account (jarvislabs.ai)
- The
jlCLI installed (installation guide) - A HuggingFace token with access to google/t5gemma-l-l-ul2-it (accept the license first)
To install the CLI:
uv tool install jarvislabs
jl setupStep 1: Create an A100 Instance
jl create --gpu A100 --region IN2 --name prismaudioTip: An A100 (40GB VRAM, 112GB RAM, $1.29/hr) is the sweet spot for PrismAudio. If you want faster inference, you can pause and resume on an IN2 H100 and your home directory (including the
uvvenv) comes back with it. Apt-installed system packages likeffmpegandlibav*do not persist across resume, so you will need to rerun the FFmpegapt-get installstep beforetoriocan load video again. H200 is currently only offered in EU1, so that's not a same-region swap from this tutorial.
Step 2: Install Everything
SSH into the instance with the CLI:
jl ssh <instance-id>If you want to run the install as a one-shot without attaching, you can also use jl exec <instance-id> -- bash -lc '...'. All commands below run inside the instance:
# Clone PrismAudio
cd /home
git clone -b prismaudio https://github.com/liuhuadai/ThinkSound.git
cd ThinkSound
# Create isolated environment with uv
uv venv .venv --python 3.10
source .venv/bin/activate
# Install VideoPrism
git clone https://github.com/google-deepmind/videoprism.git
cd videoprism && uv pip install . && cd ..
# Install dependencies (already pulls huggingface-hub, which provides the `hf` CLI)
uv pip install -r scripts/PrismAudio/setup/requirements.txt
uv pip install tensorflow-cpu==2.15.0
uv pip install facenet_pytorch==2.6.0 --no-deps
# Install FFmpeg system libraries (needed by torio)
apt-get update && apt-get install -y ffmpeg libavcodec-dev libavformat-dev libavutil-dev
# Login to HuggingFace (required for the gated T5-Gemma encoder)
huggingface-cli login --token <your-hf-token>
# Download model weights (5.8GB)
git lfs install
git clone https://huggingface.co/FunAudioLLM/PrismAudio ckptsStep 3: Run Inference
source .venv/bin/activate
export TF_CPP_MIN_LOG_LEVEL=2
# Prepare your video
mkdir -p videos cot_coarse results
cp /path/to/your/video.mp4 videos/demo.mp4
# Create CoT description
echo "id,caption_cot" > cot_coarse/cot.csv
echo 'demo,"Semantic: describe the sounds. Temporal: describe the rhythm. Aesthetic: describe the audio quality. Spatial: describe where sounds come from."' >> cot_coarse/cot.csv
# Extract features (~2-3 minutes of wall-clock on A100, including model loads)
torchrun --nproc_per_node=1 data_utils/prismaudio_data_process.py --inference_mode True
# Get video duration
DURATION=$(ffprobe -v error -show_entries format=duration -of default=noprint_wrappers=1:nokey=1 videos/demo.mp4)
# Run inference (sampling ~1.3 s; full command around 20 s once model loads count)
python predict.py \
--model-config "PrismAudio/configs/model_configs/prismaudio.json" \
--duration-sec "$DURATION" \
--ckpt-dir "ckpts/prismaudio.ckpt" \
--results-dir "results"Your generated audio is at results/MMDD_batch_size1/demo.wav.
Step 4: Launch the Gradio Web UI
For interactive use with a browser-based interface:
source .venv/bin/activate
export TF_CPP_MIN_LOG_LEVEL=2
export GRADIO_TEMP_DIR=/tmp/gradio_temp
mkdir -p /tmp/gradio_temp
python app.py --server_name 0.0.0.0 --server_port 6006app.py boots without GRADIO_TEMP_DIR being set, but the generation path in the app reads it when it writes temp files, so set it up front to avoid a KeyError on your first generation request.
The Gradio app loads all models at startup (~2 minutes), then serves on port 6006. On JarvisLabs, port 6006 is automatically exposed as an API endpoint through Cloudflare. Click the API button on your instance in the dashboard, then click API 1 to open the Gradio interface in your browser.
Upload any video, write a CoT description covering the four dimensions (Semantic, Temporal, Aesthetic, Spatial), and hit generate. In our runs, feature extraction took roughly 85 seconds inside the processing loop (full command wall-clock closer to 2-3 minutes on a cold load), and sampling itself was ~1.3 seconds with the full predict command around 20 seconds wall-clock.
Step 5: Clean Up
When you're done, pause the instance to stop billing:
jl pause <instance-id>Or destroy it entirely:
jl destroy <instance-id>Common Issues
| Issue | Cause | Fix |
|---|---|---|
GatedRepoError: 401 | T5-Gemma is a gated model | Accept license at HuggingFace, then huggingface-cli login |
Failed to initialize FFmpeg extension | torio resolves FFmpeg separately from PyAV; needs system FFmpeg 4-6 | apt-get install -y ffmpeg libavcodec-dev libavformat-dev libavutil-dev |
Gradio KeyError: 'GRADIO_TEMP_DIR' | Missing env var | export GRADIO_TEMP_DIR=/tmp/gradio_temp && mkdir -p /tmp/gradio_temp |
Cost Summary
| GPU | Region | VRAM | RAM | $/hr | Notes |
|---|---|---|---|---|---|
| A100 | IN2 | 40GB | 112GB | $1.29 | Recommended. Tested; ~1.3s sampling, ~20s full predict command. |
| H100 | IN2 or EU1 | 80GB | 200GB | $2.69 (IN2) / $2.99 (EU1) | Faster inference if you need it; IN2 is a same-region swap from this tutorial. |
| H200 | EU1 only | 141GB | 200GB | $3.80 | Maximum headroom. Currently EU1-only, so not a pause-and-resume upgrade from an IN2 A100. |
Total compute cost for this experiment on A100: under $1.
PrismAudio is a compact V2A model at 518M parameters with fast reported generation times and four-dimensional CoT reasoning that ties audio to the video's content. Licensing is split between the code repo and the HuggingFace model card, so check both before planning any production use.
If you want to try it on JarvisLabs, spin up an A100 in IN2 and follow the recipe above. Get started at jarvislabs.ai.
Get Started
Build & Deploy Your AI in Minutes
Cloud GPU infrastructure designed specifically for AI development. Start training and deploying models today.
View Pricing