Evaluate your RAG

Did you optimize your RAG using AutoRAG? You might want to compare your RAG and optimized RAG to see how much you improved. You can evaluate your own RAG function easily using decorator from AutoRAG. In other words, you can measure retrieval or generation performance on the RAG that you built already.

Preparation

Before starting, ensure you have prepared a qa.parquet file for evaluation. See here for learning how to make QA dataset.

Retrieval Evaluation

To compare the retrieval performance of your RAG with AutoRAG’s optimized version, follow these steps:

MetricInput Dataclass

Start by building a MetricInput dataclass. This structure includes several fields, but for retrieval evaluation, only query and retrieval_gt are mandatory.

Fields in MetricInput:

1.	query: The original query.
2.	queries: Expanded queries (optional).
3.	retrieval_gt_contents: Ground truth passages (optional).
4.	retrieved_contents: Retrieved passages (optional).
5.	retrieval_gt: Ground truth passage IDs.
6.	retrieved_ids: Retrieved passage IDs (optional).
7.	prompt: The prompt used for RAG generation (optional).
8.	generated_texts: Generated answers by the RAG system (optional).
9.	generation_gt: Ground truth answers (optional).
10.	generated_log_probs: Log probabilities of generated answers (optional).

Using evaluate_retrieval

You can use the evaluate_retrieval decorator to measure performance. The decorator requires:

1.	A list of metric_inputs.
2.	The names of the metrics to evaluate.

Your custom retrieval function should return the following:

1.	retrieved_contents: A list of retrieved passage contents.
2.	retrieved_ids: A list of retrieved passage IDs.
3.	retrieve_scores: A list of similarity scores.

Important: Score Alignment

To ensure accurate performance comparisons, you need to adjust the similarity scores as follows:

Distance Metric

Adjusted Score

Cosine Similarity

Use the Cosine Similarity value

L2 Distance

1 - L2 Distance

Inner Product

Use the Inner Product value

Avoid using rank-aware metrics (e.g., mRR, NDCG, mAP) if you’re uncertain about the correctness of your similarity scores.

Example Code

import pandas as pd
from autorag.schema.metricinput import MetricInput
from autorag.evaluation import evaluate_retrieval

qa_df = pd.read_parquet("qa.parquet", engine="pyarrow")
metric_inputs = list(map(lambda x: MetricInput(
    query=x[1]["query"],
    retrieval_gt=x[1]["retrieval_gt"],
), qa_df.iterrows()))

@evaluate_retrieval(
    metric_inputs=metric_inputs,
    metrics=["retrieval_f1", "retrieval_recall", "retrieval_precision",
                   "retrieval_ndcg", "retrieval_map", "retrieval_mrr"]
)
def custom_retrieval(queries):
    # Your custom retrieval function
    # You have to return the retrieved_contents, retrieved_ids, retrieve_scores as List
    return retrieved_contents, retrieved_ids, retrieve_scores

retrieval_result_df = custom_retrieval(qa_df["query"].tolist())

Now you can see the result at the pandas DataFrame retrieval_result_df.

Generation Evaluation

To evaluate the performance of RAG-generated answers, the process is similar to retrieval evaluation.

MetricInput for Generation

For generation evaluation, the required fields are:

•	query: The original query.
•	generation_gt: Ground truth answers.

Using evaluate_generation

The custom generation function must return:

1.	generated_texts: A list of generated answers.
2.	generated_tokens: A dummy list of tokens, matching the length of generated_texts.
3.	generated_log_probs: A dummy list of log probabilities, matching the length of generated_texts.

Example Code

import pandas as pd
from autorag.schema.metricinput import MetricInput
from autorag.evaluation import evaluate_generation

# Load QA dataset
qa_df = pd.read_parquet("qa.parquet", engine="pyarrow")

# Prepare MetricInput list
metric_inputs = [
    MetricInput(query=row["query"], generation_gt=row["generation_gt"])
    for _, row in qa_df.iterrows()
]

# Define custom generation function with decorator
@evaluate_generation(
    metric_inputs=metric_inputs,
    metrics=["bleu", "meteor", "rouge"]
)
def custom_generation(queries):
    # Implement your generation logic
    return generated_texts, [[1, 30]] * len(generated_texts), [[-1, -1.3]] * len(generated_texts)

# Evaluate generation performance
generation_result_df = custom_generation(qa_df["query"].tolist())

Advanced Configuration

You can configure metrics using a dictionary. For example, if using semantic similarity (sem_score), specify additional parameters like the embedding model:

@evaluate_generation(
    metric_inputs=metric_inputs,
    metrics=[
        {"metric_name": "sem_score", "embedding_model": "openai_embed_3_small"},
        {"metric_name": "bleu"}
    ]
)

By following these steps, you can effectively compare and evaluate your RAG system against the optimized AutoRAG pipeline.