vllm¶
The vllm
module is generator that using vllm.
Why use vllm module?¶
vllm
can generate new texts really fast. Its speed is more than 10x faster than a huggingface transformers library.
You can use vllm
model with llama_index_llm module, but it is really slow because LlamaIndex
does not optimize for processing many prompts at once.
So, we decide to make a standalone module for vllm, for faster generation speed.
Module Parameters¶
llm: You can type your ‘model name’ at here. For example,
facebook/opt-125m
ormistralai/Mistral-7B-Instruct-v0.2
.max_tokens: The maximum number of tokens to generate.
temperature: The temperature of the sampling. Higher temperature means more randomness.
top_p: Float that controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens.
And all parameters from LLM initialization and Sampling Params.
Example config.yaml¶
modules:
- module_type: vllm
llm: mistralai/Mistral-7B-Instruct-v0.2
temperature: [ 0.1, 1.0 ]
max_tokens: 512
Use in Multi-GPU¶
First, for more details, you must check out vllm docs about parallel processing.
When you use multi gpu, you can set tensor_parallel_size
parameter at YAML file.
modules:
- module_type: vllm
llm: mistralai/Mistral-7B-Instruct-v0.2
tensor_parallel_size: 2 # If the gpu is two.
temperature: [ 0.1, 1.0 ]
max_tokens: 512
Also, you can use any parameter from vllm.LLM
, SamplingParams
, and EngineArgs
.
Plus, you can use it over v0.2.16, so you must be upgrade to the latest version.
Warning
We are developing multi-gpu compatibility for AutoRAG now. So, please wait for the full compatibilty to multi-gpu environment.
Warning
When using the vllm module, errors may occur depending on the configuration of PyTorch. In such cases, please follow the instructions below:
Define the vllm module to operate in a single-case mode.
Set the skip_validation parameter to True when using the start_trial function in the evaluator.