Start creating your own evaluation data¶
Index¶
Overview¶
For the evaluation of RAGs we need data, but in most cases we have little or no satisfactory data.
However, since the advent of LLM, creating synthetic data has become one of the good solutions to this problem.
The following guide covers how to use LLM to create data in a form that AutoRAG can use.
AutoRAG aims to work with Python’s ‘primitive data types’ for scalability and convenience.
Therefore, to use AutoRAG, you need to convert your raw data into corpus data
and qa data
to our data format.
Make corpus data from raw documents¶
Load your raw data to texts with loaders such as lama_index, LangChain, etc.
Chunk the texts into passages. Use Langchain, LlamaIndex, etc.
Make it into corpus data to use converter functions. There are converter functions for llama index
Document
,TextNode
, and LangchainDocument
objects, which isllama_document_to_parquet
,llama_text_node_to_parquet
, andlangchain_document_to_parquet
.
Use Llama Index
from llama_index.core import SimpleDirectoryReader
from llama_index.core.node_parser import TokenTextSplitter
from autorag.data.legacy.corpus import llama_text_node_to_parquet
documents = SimpleDirectoryReader('your_dir_path').load_data()
nodes = TokenTextSplitter(chunk_size=512, chunk_overlap=128).get_nodes_from_documents(documents=documents)
corpus_df = llama_text_node_to_parquet(nodes, 'path/to/corpus.parquet')
Use LangChain
from langchain_community.document_loaders import DirectoryLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from autorag.data.legacy.corpus import langchain_documents_to_parquet
documents = DirectoryLoader('your_dir_path', glob='**/*.md').load()
documents = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=128).split_documents(documents)
corpus_df = langchain_documents_to_parquet(documents, 'path/to/corpus.parquet')
Tip
The format for corpus data can be found corpus data format
Make qa data from corpus data¶
Tip
The format for qa data can be found qa data format
import pandas as pd
from llama_index.llms.openai import OpenAI
from autorag.data.legacy.qacreation import generate_qa_llama_index, make_single_content_qa
corpus_df = pd.read_parquet('path/to/corpus.parquet')
llm = OpenAI(model='gpt-3.5-turbo', temperature=1.0)
qa_df = make_single_content_qa(corpus_df, 50, generate_qa_llama_index, llm=llm, question_num_per_content=1,
output_filepath='path/to/qa.parquet', cache_batch=64)
generate_qa_llama_index
is a function designed to generate questions and its generation_gt per content.
You can set the number of questions per content by changing question_num_per_content
parameter.
And the make_single_content_qa
function is designed to generate qa.parquet
file using input function.
It generates ‘single content’ qa data, also known as ‘single-hop’ or ‘single-document’ QA data.
Which means it uses only one passage per question for answering the question.
What is passage?
Passage is chunked units from raw data.
Auto-save feature
From AutoRAG v0.2.9, the auto-save feature added!
Now, you don’t have to afraid that something wrong while the data generation.
The data will save automatically to the input output_filepath
.
You can set how often you want to save the result to the file.
Just adjust cache_batch
parameter. Default is 32.
Use custom prompt¶
You can use custom prompt to generate qa data. The prompt must contains two placeholders:
{{text}}: The content string
{{num_questions}}: The number of questions to generate
import pandas as pd
from llama_index.llms.openai import OpenAI
from autorag.data.legacy.qacreation import generate_qa_llama_index, make_single_content_qa
prompt = """
Generate question and answer pairs for the given passage.
Passage:
{{text}}
Number of questions to generate: {{num_questions}}
Example:
[Q]: What is this?
[A]: This is a sample question.
Result:
"""
corpus_df = pd.read_parquet('path/to/corpus.parquet')
llm = OpenAI(model='gpt-3.5-turbo', temperature=1.0)
qa_df = make_single_content_qa(corpus_df, content_size=50, qa_creation_func=generate_qa_llama_index,
llm=llm, prompt=prompt, question_num_per_content=1)
Use multiple prompts¶
If you want to generate different types of question and answer pairs, you can use multiple prompts. From now, we support distributing multiple prompts by randomly based on the ratio of each prompt. It means that the prompt will be selected by a ratio per passage.
For this, you must provide a dictionary. The dictionary must have the key, which is the prompt text file path, and the value which is the ratio of the prompt.
import pandas as pd
from llama_index.llms.openai import OpenAI
from autorag.data.legacy.qacreation import generate_qa_llama_index_by_ratio, make_single_content_qa
ratio_dict = {
'prompt1.txt': 1,
'prompt2.txt': 2,
'prompt3.txt': 3
}
corpus_df = pd.read_parquet('path/to/corpus.parquet')
llm = OpenAI(model='gpt-3.5-turbo', temperature=1.0)
qa_df = make_single_content_qa(corpus_df, content_size=50, qa_creation_func=generate_qa_llama_index_by_ratio,
llm=llm, prompts_ratio=ratio_dict, question_num_per_content=1, batch=6)
Warning
Remeber all prompts must have the placeholders {{text}}
and {{num_questions}}
.
When you have existing qa data¶
When you have existing qa data, you can use it for AutoRAG. The real user’s qa data is valuable data, so it is always great to use it prior to generating synthetic data.
But you have to make retrieval_gt for existing queries from your corpus data. The process to find the retrieval_gt at the corpus is hard but must be accurate. To make it less hard, we use an embedding model and vectordb for finding relevant passages. After that, you have to clarify the retrieval_gt is right. If retrieval_gt is not relevant, you have to remove it from the dataset.
This feature is available if you have only query ready, and if you have both query and generation_gt ready.
If you only have query data:¶
First get retrieval_gt with the existing query, then put a query and retrieval_gt into LLM and generate generation_gt.
answer_creation_func
,llm
parameters are necessary.existing_qa_df
must have ‘query’ column.
import pandas as pd
from llama_index.llms.openai import OpenAI
from autorag.data.legacy.qacreation import make_qa_with_existing_qa, generate_answers
corpus_df = pd.read_parquet('path/to/corpus.parquet')
existing_qa_df = pd.read_parquet('path/to/existing_qa.parquet') # It has to contain 'query' column
llm = OpenAI(model='gpt-3.5-turbo', temperature=1.0)
qa_df = make_qa_with_existing_qa(corpus_df, existing_qa_df, content_size=50,
answer_creation_func=generate_answers,
llm=llm, output_filepath='path/to/qa.parquet', cache_batch=64,
embedding_model='openai_embed_3_large', top_k=5)
You can use PersistentClient
for saving corpus embeddings locally as well.
import pandas as pd
import chromadb
from llama_index.llms.openai import OpenAI
from autorag.data.legacy.qacreation import make_qa_with_existing_qa, generate_answers
client = chromadb.PersistentClient('path/to/chromadb')
collection = client.get_or_create_collection('auto-rag')
corpus_df = pd.read_parquet('path/to/corpus.parquet')
existing_qa_df = pd.read_parquet('path/to/existing_qa.parquet') # It has to contain 'query' column
llm = OpenAI(model='gpt-3.5-turbo', temperature=1.0)
qa_df = make_qa_with_existing_qa(corpus_df, existing_qa_df, content_size=50,
answer_creation_func=generate_answers, collection=collection,
llm=llm, output_filepath='path/to/qa.parquet', cache_batch=64,
embedding_model='openai_embed_3_large', top_k=5)
If you have both query and generation_gt:¶
Use a query and generation_gt as they are, and just find and add retrieval_gt.
answer_creation_func
,llm
parameters are not necessary.exist_gen_gt=True
parameter is necessary.existing_qa_df
must have ‘query’ and ‘generation_gt’ columns.generation_gt(per query) must be in the form of List[str].
import pandas as pd
from llama_index.llms.openai import OpenAI
from autorag.data.legacy.qacreation import make_qa_with_existing_qa
corpus_df = pd.read_parquet('path/to/corpus.parquet')
existing_qa_df = pd.read_parquet(
'path/to/existing_qa.parquet') # It has to contain 'query' and 'generation_gt' columns.
llm = OpenAI(model='gpt-3.5-turbo', temperature=1.0)
qa_df = make_qa_with_existing_qa(corpus_df, existing_qa_df, content_size=50, exist_gen_gt=True,
output_filepath='path/to/qa.parquet', cache_batch=64,
embedding_model='openai_embed_3_large', top_k=5)
You can use PersistentClient
for saving corpus embeddings locally as well.
import pandas as pd
import chromadb
from llama_index.llms.openai import OpenAI
from autorag.data.legacy.qacreation import make_qa_with_existing_qa
client = chromadb.PersistentClient('path/to/chromadb')
collection = client.get_or_create_collection('auto-rag')
corpus_df = pd.read_parquet('path/to/corpus.parquet')
existing_qa_df = pd.read_parquet(
'path/to/existing_qa.parquet') # It has to contain 'query' and 'generation_gt' columns.
llm = OpenAI(model='gpt-3.5-turbo', temperature=1.0)
qa_df = make_qa_with_existing_qa(corpus_df, existing_qa_df, content_size=50,
exist_gen_gt=True, collection=collection,
output_filepath='path/to/qa.parquet', cache_batch=64,
embedding_model='openai_embed_3_large', top_k=5)