Evaluate your RAG¶
Did you optimize your RAG using AutoRAG? You might want to compare your RAG and optimized RAG to see how much you improved. You can evaluate your own RAG function easily using decorator from AutoRAG. In other words, you can measure retrieval or generation performance on the RAG that you built already.
Preparation¶
Before starting, ensure you have prepared a qa.parquet
file for evaluation.
See here for learning how to make QA dataset.
Retrieval Evaluation¶
To compare the retrieval performance of your RAG with AutoRAG’s optimized version, follow these steps:
MetricInput
Dataclass¶
Start by building a MetricInput
dataclass.
This structure includes several fields, but for retrieval evaluation, only query and retrieval_gt are mandatory.
Fields in MetricInput:
1. query: The original query.
2. queries: Expanded queries (optional).
3. retrieval_gt_contents: Ground truth passages (optional).
4. retrieved_contents: Retrieved passages (optional).
5. retrieval_gt: Ground truth passage IDs.
6. retrieved_ids: Retrieved passage IDs (optional).
7. prompt: The prompt used for RAG generation (optional).
8. generated_texts: Generated answers by the RAG system (optional).
9. generation_gt: Ground truth answers (optional).
10. generated_log_probs: Log probabilities of generated answers (optional).
Using evaluate_retrieval¶
You can use the evaluate_retrieval decorator to measure performance. The decorator requires:
1. A list of metric_inputs.
2. The names of the metrics to evaluate.
Your custom retrieval function should return the following:
1. retrieved_contents: A list of retrieved passage contents.
2. retrieved_ids: A list of retrieved passage IDs.
3. retrieve_scores: A list of similarity scores.
Important: Score Alignment¶
To ensure accurate performance comparisons, you need to adjust the similarity scores as follows:
Distance Metric |
Adjusted Score |
---|---|
Cosine Similarity |
Use the Cosine Similarity value |
L2 Distance |
1 - L2 Distance |
Inner Product |
Use the Inner Product value |
Avoid using rank-aware metrics (e.g., mRR, NDCG, mAP) if you’re uncertain about the correctness of your similarity scores.
Example Code¶
import pandas as pd
from autorag.schema.metricinput import MetricInput
from autorag.evaluation import evaluate_retrieval
qa_df = pd.read_parquet("qa.parquet", engine="pyarrow")
metric_inputs = list(map(lambda x: MetricInput(
query=x[1]["query"],
retrieval_gt=x[1]["retrieval_gt"],
), qa_df.iterrows()))
@evaluate_retrieval(
metric_inputs=metric_inputs,
metrics=["retrieval_f1", "retrieval_recall", "retrieval_precision",
"retrieval_ndcg", "retrieval_map", "retrieval_mrr"]
)
def custom_retrieval(queries):
# Your custom retrieval function
# You have to return the retrieved_contents, retrieved_ids, retrieve_scores as List
return retrieved_contents, retrieved_ids, retrieve_scores
retrieval_result_df = custom_retrieval(qa_df["query"].tolist())
Now you can see the result at the pandas DataFrame retrieval_result_df.
Generation Evaluation¶
To evaluate the performance of RAG-generated answers, the process is similar to retrieval evaluation.
MetricInput
for Generation¶
For generation evaluation, the required fields are:
• query: The original query.
• generation_gt: Ground truth answers.
Using evaluate_generation¶
The custom generation function must return:
1. generated_texts: A list of generated answers.
2. generated_tokens: A dummy list of tokens, matching the length of generated_texts.
3. generated_log_probs: A dummy list of log probabilities, matching the length of generated_texts.
Example Code
import pandas as pd
from autorag.schema.metricinput import MetricInput
from autorag.evaluation import evaluate_generation
# Load QA dataset
qa_df = pd.read_parquet("qa.parquet", engine="pyarrow")
# Prepare MetricInput list
metric_inputs = [
MetricInput(query=row["query"], generation_gt=row["generation_gt"])
for _, row in qa_df.iterrows()
]
# Define custom generation function with decorator
@evaluate_generation(
metric_inputs=metric_inputs,
metrics=["bleu", "meteor", "rouge"]
)
def custom_generation(queries):
# Implement your generation logic
return generated_texts, [[1, 30]] * len(generated_texts), [[-1, -1.3]] * len(generated_texts)
# Evaluate generation performance
generation_result_df = custom_generation(qa_df["query"].tolist())
Advanced Configuration¶
You can configure metrics using a dictionary. For example, if using semantic similarity (sem_score), specify additional parameters like the embedding model:
@evaluate_generation(
metric_inputs=metric_inputs,
metrics=[
{"metric_name": "sem_score", "embedding_model": "openai_embed_3_small"},
{"metric_name": "bleu"}
]
)
By following these steps, you can effectively compare and evaluate your RAG system against the optimized AutoRAG pipeline.