Table of Contents
vLLM is an advanced, open-source Python library engineered for efficient deployment and high-throughput serving of large language models (LLMs). It boasts a state-of-the-art serving throughput and introduces innovative features like PagedAttention for optimal attention memory management and continuous batching for request handling. The library supports a wide array of models from Hugging Face, including but not limited to GPTQ, SqueezeLLM, LLaMA-2, and GPT-J.
If you want to know more about vLLM in detail, jump here.
How to launch vLLM in jarvisLabs
Navigate to the vLLM instance on the jarvisLabs' frameworks page.
After you click the framework you will be redirected to the launch page. Check out the details about the launch page here.
Generative Transformers Models
In vLLM you can see extras option where you can choose different models that you want to use.
Architectures & Models
We have 35+ LLMs from 20 different architectures like
- Mixtral etc.
You can also enter a custom model for your use case.
Then hit the launch button to launch your vLLM instance.
After the instance is successfully launched, you can navigate to your instances page and click the api of the vLLM instance.
Please wait for few minutes for the model to download
Learn how to use vLLM and how it works under the hood check out this video on our youtube.
Serving of LLMs can be surprisingly slow even on expensive hardware. This is why vLLM was built. vLLM is an open-source library for fast LLM inferencing and serving. It uses PagedAttention and Continuous batching to serve and inference LLMs faster. It has up to 24x higher throughput than HuggingFace Transformers, without requiring model architecture changes.
The performance of LLM serving is bottlenecked by memory. This is due to the caching of attention key and value tensors in the GPU memory. These are called as KV cache. Typically these KV caches take a large amount of space hence the slow serving and inferencing. This also results in the waste of 60% to 80% of memory due to fragmentation and over-reservation.
PagedAttention is an attention algorithm inspired by the idea of virtual memory and paging in operation systems. Virtual memory is a concept where the storages like SSDs, NVMEs and M.2 SSDs are utilized to act as the memory. Paging is a memory management scheme that eliminates the need for contiguous allocation of physical memory.
PagedAttention algorithm partitions the KV cache of each sequence into blocks. Each of these blocks contains the keys and values for a fixed number of tokens. This makes the algorithm to efficiently identify and fetch these blocks efficiently.
In the above example we can see the Key for and the values(Together as KV cache) are partitioned into blocks.