---
myst:
   html_meta:
      title: AutoRAG - vLLM
      description: Use vLLM in AutoRAG. Highly optimized for AutoRAG when you use local model on GPU.
      keywords: AutoRAG,RAG,LLM,generator,vLLM,LLM inference, AutoRAG multi gpu
---
# vllm

The `vllm` module is generator that using [vllm](https://blog.vllm.ai/2023/06/20/vllm.html).

## Why use vllm module?

`vllm` can generate new texts really fast. Its speed is more than 10x faster than a huggingface transformers library.

You can use `vllm` model with [llama_index_llm module](./llama_index_llm.md), but it is really slow because LlamaIndex
does not optimize for processing many prompts at once.

So, we decide to make a standalone module for vllm, for faster generation speed.

## **Module Parameters**

- **llm**: You can type your 'model name' at here. For example, `facebook/opt-125m`
  or `mistralai/Mistral-7B-Instruct-v0.2`.
- **max_tokens**: The maximum number of tokens to generate.
- **temperature**: The temperature of the sampling. Higher temperature means more randomness.
- **top_p**: Float that controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1
  to consider all tokens.
- And all parameters
  from [LLM initialization](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/llm.py#L14)
  and [Sampling Params](https://github.com/vllm-project/vllm/blob/main/vllm/sampling_params.py#L25).

## **Example config.yaml**

```yaml
modules:
  - module_type: vllm
    llm: mistralai/Mistral-7B-Instruct-v0.2
    temperature: [ 0.1, 1.0 ]
    max_tokens: 512
```

## Use in Multi-GPU

First, for more details,
you must check out [vllm docs](https://docs.vllm.ai/en/latest/serving/distributed_serving.html) about parallel processing.

When you use multi gpu, you can set `tensor_parallel_size` parameter at YAML file.

```yaml
modules:
  - module_type: vllm
    llm: mistralai/Mistral-7B-Instruct-v0.2
    tensor_parallel_size: 2 # If the gpu is two.
    temperature: [ 0.1, 1.0 ]
    max_tokens: 512
```

Also, you can use any parameter from `vllm.LLM`, `SamplingParams`, and `EngineArgs`.

Plus, you can use it over v0.2.16, so you must be upgrade to the latest version.

```{warning}
We are developing multi-gpu compatibility for AutoRAG now.
So, please wait for the full compatibilty to multi-gpu environment.
```