# Evaluation data creation tutorial ## Overview For the evaluation of RAGs we need data, but in most cases we have little or no satisfactory data. However, since the advent of LLM, creating synthetic data has become one of the good solutions to this problem. The following guide covers how to use LLM to create data in a form that AutoRAG can use. --- ![Data Creation](../../_static/data_creation.png) ## 1. Parse You can make different parsing results from the raw data using the parsing YAML file. The sample parsing YAML file looks like this. ```yaml modules: - module_type: langchain_parse parse_method: [pdfminer, pdfplumber] ``` With this YAML file, you can get the parsed data with pdfminer and pdfplumber. You can execute this parsing YAML file by using the following code. ```python from autorag.parser import Parser filepaths = "./data/*.pdf" parser = Parser(filepaths, "./parse_project_dir") parser.start_parsing("./parsing.yaml") ``` Then you can check out the parsing result in the `./parse_project_dir` directory. For more details about parsing, please refer [here](./parse/parse.md). ## 2. QA Creation From the parsed results, you can select the best parsed data for AutoRAG. After you selected, you can create QA data for the AutoRAG. The example is shown below, the `initial_raw_df` is selected raw data. ```python from llama_index.llms.openai import OpenAI from autorag.data.beta.filter.dontknow import dontknow_filter_rule_based from autorag.data.beta.generation_gt.llama_index_gen_gt import ( make_basic_gen_gt, make_concise_gen_gt, ) from autorag.data.beta.query.llama_gen_query import factoid_query_gen from autorag.data.beta.sample import random_single_hop from autorag.data.beta.schema import Raw initial_raw = Raw(initial_raw_df) initial_corpus = initial_raw.chunk( "llama_index_chunk", chunk_method="token", chunk_size=128, chunk_overlap=5 ) llm = OpenAI() initial_qa = ( initial_corpus.sample(random_single_hop, n=3) .map( lambda df: df.reset_index(drop=True), ) .make_retrieval_gt_contents() .batch_apply( factoid_query_gen, llm=llm, ) .batch_apply( make_basic_gen_gt, llm=llm, ) .batch_apply( make_concise_gen_gt, llm=llm, ) .filter( dontknow_filter_rule_based, lang="en", ) ) initial_qa.to_parquet("./initial_qa.parquet", "./initial_corpus.parquet") ``` We recommend you find the optimal pipeline first from this initial data. Check out [here](../../tutorial.md) to see the optimization tutorial. ## 3. Chunking Optimization After finding the initial optimal pipeline, this time you are to optimize the chunking method. First, you can create various chunking results from the parsed data. The chunking YAML file looks like this. ```yaml modules: - module_type: llama_index_chunk chunk_method: [ Token, Sentence ] chunk_size: [ 1024, 512 ] chunk_overlap: 24 add_file_name: english - module_type: llama_index_chunk chunk_method: [ SentenceWindow ] sentence_splitter: kiwi window_size: 3 add_file_name: english ``` With this YAML file, you can get the chunked data with Token, Sentence, and SentenceWindow with different chunk sizes. You can execute this chunking YAML file by using the following code. ```python from autorag.chunker import Chunker chunker = Chunker.from_parquet("./initial_raw.parquet", "./chunk_project_dir") chunker.start_chunking("./chunking.yaml") ``` Then you can check out the chunking result in the `./chunk_project_dir` directory. For more details about chunking, please refer [here](./chunk/chunk.md). ## 4. QA - Corpus mapping For the chunking optimization, you can evaluate RAG performance with different corpus data. You already have the optimal pipeline from the initial QA data, so you can use this pipeline to evaluate the RAG performance with different corpus data. Before that, you must update all qa data with the new corpus data. It uses `update_corpus` method. It is highly recommending you to keep the initial `QA` instance. If not, you need to build `QA` instance again from the initial raw (parsed) data and corpus data. ```python from autorag.data.beta.schema import Raw, Corpus, QA raw = Raw(initial_raw_df) corpus = Corpus(initial_corpus_df, raw) qa = QA(initial_qa_df, corpus) new_qa = qa.update_corpus(Corpus(new_corpus_df, raw)) ``` Now `new_qa` have new `retrieval_gt` data for the new corpus. Now with the new corpus data and new qa datas, you can evaluate the RAG performance with different corpus data.