--- myst: html_meta: title: AutoRAG - vLLM description: Use vLLM in AutoRAG. Highly optimized for AutoRAG when you use local model on GPU. keywords: AutoRAG,RAG,LLM,generator,vLLM,LLM inference, AutoRAG multi gpu --- # vllm The `vllm` module is generator that using [vllm](https://blog.vllm.ai/2023/06/20/vllm.html). ## Why use vllm module? `vllm` can generate new texts really fast. Its speed is more than 10x faster than a huggingface transformers library. You can use `vllm` model with [llama_index_llm module](./llama_index_llm.md), but it is really slow because LlamaIndex does not optimize for processing many prompts at once. So, we decide to make a standalone module for vllm, for faster generation speed. ## **Module Parameters** - **llm**: You can type your 'model name' at here. For example, `facebook/opt-125m` or `mistralai/Mistral-7B-Instruct-v0.2`. - **max_tokens**: The maximum number of tokens to generate. - **temperature**: The temperature of the sampling. Higher temperature means more randomness. - **top_p**: Float that controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens. - And all parameters from [LLM initialization](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/llm.py#L14) and [Sampling Params](https://github.com/vllm-project/vllm/blob/main/vllm/sampling_params.py#L25). ## **Example config.yaml** ```yaml modules: - module_type: vllm llm: mistralai/Mistral-7B-Instruct-v0.2 temperature: [ 0.1, 1.0 ] max_tokens: 512 ``` ## Use in Multi-GPU First, for more details, you must check out [vllm docs](https://docs.vllm.ai/en/latest/serving/distributed_serving.html) about parallel processing. When you use multi gpu, you can set `tensor_parallel_size` parameter at YAML file. ```yaml modules: - module_type: vllm llm: mistralai/Mistral-7B-Instruct-v0.2 tensor_parallel_size: 2 # If the gpu is two. temperature: [ 0.1, 1.0 ] max_tokens: 512 ``` Also, you can use any parameter from `vllm.LLM`, `SamplingParams`, and `EngineArgs`. Plus, you can use it over v0.2.16, so you must be upgrade to the latest version. ```{warning} We are developing multi-gpu compatibility for AutoRAG now. So, please wait for the full compatibilty to multi-gpu environment. ```