Configure Vector DB

Overview

AutoRAG supports connection to multiple Vector Databases. This document outlines how to configure them.

Supported Vector Databases

  • Chroma (requires a link for setup)

  • Milvus

Usage

Configure your Vector DBs in the YAML file as shown below. For more detailed information on YAML file syntax, please refer to here.

vectordb:
  - name: openai_embed_3_small
    db_type: chroma
    client_type: persistent
    embedding_model: openai_embed_3_small
    collection_name: openai_embed_3_small
    path: ${PROJECT_DIR}/resources/chroma
  - name: openai_embed_3_large
    db_type: milvus
    embedding_model: openai_embed_3_large
    collection_name: openai_embed_3_large
    uri: ${MILVUS_URI}
    token: ${MILVUS_TOKEN}
    embedding_batch: 50

In this example, we’ve configured two vector databases:

  1. The first uses the openai_embed_3_small model with Chroma. Chroma with a persistent client allows for simple local vector database usage.

  2. The second uses Milvus. By providing the appropriate URI and token, you can access a remote Milvus instance. And it uses openai_embed_3_large as the embedding model.

For specific parameters and detailed configuration options, please refer to the respective documentation for each vector database.

Full Ingest Option

The full_ingest option is crucial when using vector databases. By default, it’s set to True. However, if your corpus is particularly large, it’s recommended to set it to False.

When full_ingest is True, the system verifies that all IDs in the corpus.parquet file are properly stored in the vector database. This can be time-consuming and expensive for large datasets.

If set to False, the system only checks for the existence of IDs from the retrieval_gt in the QA data for each vector DB. This check cannot be disabled as the QA data is typically much smaller than the corpus, and crucial for successful trail on AutoRAG.

To set full_ingest to False, use one of the following methods:

Command Line

autorag evaluate --config your/path/to/default_config.yaml --qa_data_path your/path/to/qa.parquet --corpus_data_path your/path/to/corpus.parquet --project_dir ./your/project/directory --full_ingest False

Python Code

from autorag.evaluator import Evaluator

evaluator = Evaluator(qa_data_path='your/path/to/qa.parquet', corpus_data_path='your/path/to/corpus.parquet',
                      project_dir='your/path/to/project_directory')
evaluator.start_trial('your/path/to/config.yaml', full_ingest=False)

Default Options

AutoRAG provides a default Chroma vector database configuration if none is specified in the YAML file. In this case:

  • Chroma will store data in the project_dir/resources/chroma folder

  • The OpenAI text-embedding-ada-002 model will be used for embeddings

You can use this default configuration by specifying it in the vectordb module of your YAML file:

modules:
  - module_type: vectordb
    vectordb: default

This default option allows for quick setup and experimentation without the need for extensive configuration. However, for production use or specific requirements, it’s recommended to explicitly configure your vector database settings.

Additional Considerations

  • If you having trouble at embedding phase, consider set embedding_batch parameter lower.

  • Always ensure that your environment variables (like ${MILVUS_URI} and ${MILVUS_TOKEN}) are properly set when using remote databases.