autorag.data.qacreation package

Submodules

autorag.data.qacreation.base module

autorag.data.qacreation.base.make_qa_with_existing_qa(corpus_df: DataFrame, existing_query_df: DataFrame, content_size: int, answer_creation_func: Callable | None = None, exist_gen_gt: bool | None = False, output_filepath: str | None = None, embedding_model: str = 'openai_embed_3_large', collection: Collection | None = None, upsert: bool = False, random_state: int = 42, cache_batch: int = 32, top_k: int = 3, **kwargs) DataFrame[source]

Make single-hop QA dataset using given qa_creation_func and existing queries.

Parameters:
  • corpus_df – The corpus dataframe to make QA dataset from.

  • existing_query_df – Dataframe containing existing queries to use for QA pair creation.

  • content_size – This function will generate QA dataset for the given number of contents.

  • answer_creation_func – Optional function to create answer with input query. If exist_gen_gt is False, this function must be given.

  • exist_gen_gt – Optional boolean to use existing generation_gt. If True, the existing_query_df must have ‘generation_gt’ column. If False, the answer_creation_func must be given.

  • output_filepath – Optional filepath to save the parquet file.

  • embedding_model – The embedding model to use for vectorization. You can add your own embedding model in the autorag.embedding_models. Please refer to how to add an embedding model in this doc: https://docs.auto-rag.com/local_model.html The default is ‘openai_embed_3_large’.

  • collection – The chromadb collection to use for vector DB. You can make any chromadb collection and use it here. If you already ingested the corpus_df to the collection, the embedding process will not be repeated. The default is None. If None, it makes a temporary collection.

  • upsert – If true, the function will overwrite the existing file if it exists.

  • random_state – The random state for sampling corpus from the given corpus_df.

  • cache_batch – The number of batches to use for caching the generated QA dataset.

  • top_k – The number of sources to refer by model. Default is 3.

  • kwargs – The keyword arguments for qa_creation_func.

Returns:

QA dataset dataframe.

autorag.data.qacreation.base.make_single_content_qa(corpus_df: DataFrame, content_size: int, qa_creation_func: Callable, output_filepath: str | None = None, upsert: bool = False, random_state: int = 42, cache_batch: int = 32, **kwargs) DataFrame[source]

Make single content (single-hop, single-document) QA dataset using given qa_creation_func. It generates a single content QA dataset, which means its retrieval ground truth will be only one. It is the most basic form of QA dataset.

Parameters:
  • corpus_df – The corpus dataframe to make QA dataset from.

  • content_size – This function will generate QA dataset for the given number of contents.

  • qa_creation_func – The function to create QA pairs. You can use like generate_qa_llama_index or generate_qa_llama_index_by_ratio. The input func must have contents parameter for the list of content string.

  • output_filepath – Optional filepath to save the parquet file. If None, the function will return the processed_data as pd.DataFrame, but do not save as parquet. File directory must exist. File extension must be .parquet

  • upsert – If true, the function will overwrite the existing file if it exists. Default is False.

  • random_state – The random state for sampling corpus from the given corpus_df.

  • cache_batch – The number of batches to use for caching the generated QA dataset. When the cache_batch size data is generated, the dataset will save to the designated output_filepath. If the cache_batch size is too small, the process time will be longer.

  • kwargs – The keyword arguments for qa_creation_func.

Returns:

QA dataset dataframe. You can save this as parquet file to use at AutoRAG.

autorag.data.qacreation.llama_index module

async autorag.data.qacreation.llama_index.async_qa_gen_llama_index(content: str, llm: LLM, prompt: str, question_num: int = 1, max_retries: int = 3)[source]

Generate a qa set by using the given content and the llama index model. You must select the question type.

Parameters:
  • content – Content string

  • llm – Llama index model

  • prompt – The prompt to use for the qa generation. The prompt must include the following placeholders: - {{text}}: The content string - {{num_questions}}: The number of questions to generate

  • question_num – The number of questions to generate

  • max_retries – Maximum number of retries when generated question number is not equal to the target number

Returns:

List of dictionaries containing the query and generation_gt

autorag.data.qacreation.llama_index.distribute_list_by_ratio(input_list, ratio) List[List[Any]][source]
autorag.data.qacreation.llama_index.generate_answers(llm: LLM, contents: List[str], queries: List[str], batch: int = 4) List[List[Dict]][source]

Generate qa sets from the list of contents using existing queries.

Parameters:
  • llm – Llama index model

  • contents – List of content strings.

  • queries – List of existing queries.

  • batch – The batch size to process asynchronously.

Returns:

2-d list of dictionaries containing the query and generation_gt.

async autorag.data.qacreation.llama_index.generate_basic_answer(llm: LLM, passage_str: str, query: str) str[source]
autorag.data.qacreation.llama_index.generate_qa_llama_index(llm: LLM, contents: List[str], prompt: str | None = None, question_num_per_content: int = 1, max_retries: int = 3, batch: int = 4) List[List[Dict]][source]

Generate a qa set from the list of contents. It uses a single prompt for all contents. If you want to use more than one prompt for generating qa, you can consider using generate_qa_llama_index_by_ratio.

Parameters:
  • llm – Llama index model

  • contents – List of content strings.

  • prompt – The prompt to use for the qa generation. The prompt must include the following placeholders: - {{text}}: The content string - {{num_questions}}: The number of questions to generate As default, the prompt is set to the default prompt for the question type.

  • question_num_per_content – Number of questions to generate for each content. Default is 1.

  • max_retries – The maximum number of retries when generated question number is not equal to the target number. Default is 3.

  • batch – The batch size to process asynchronously. Default is 4.

Returns:

2-d list of dictionaries containing the query and generation_gt.

autorag.data.qacreation.llama_index.generate_qa_llama_index_by_ratio(llm: LLM, contents: List[str], prompts_ratio: Dict, question_num_per_content: int = 1, max_retries: int = 3, random_state: int = 42, batch: int = 4) List[List[Dict]][source]

Generate a qa set from the list of contents. You can set the ratio of prompts that you want to use for generating qa. It distributes the number of questions to generate for each content by the ratio randomly.

Parameters:
  • llm – Llama index model

  • contents – List of content strings.

  • prompts_ratio – Dictionary of prompt paths and their ratios. Example: {“prompt/prompt1.txt”: 0.5, “prompt/prompt2.txt”: 0.5} The value sum doesn’t have to be 1. The path must be the absolute path, and the file must exist. Plus, it has to be a text file which contains proper prompt. Each prompt must contain the following placeholders: - {{text}}: The content string - {{num_questions}}: The number of questions to generate

  • question_num_per_content – Number of questions to generate for each content. Default is 1.

  • max_retries – The maximum number of retries when generated question number is not equal to the target number. Default is 3.

  • random_state – Random seed Default is 42.

  • batch – The batch size to process asynchronously. Default is 4.

Returns:

2-d list of dictionaries containing the query and generation_gt.

autorag.data.qacreation.llama_index.parse_output(result: str) List[Dict][source]
autorag.data.qacreation.llama_index.validate_llama_index_prompt(prompt: str) bool[source]

Validate the prompt for the llama index model. The prompt must include the following placeholders: - {{text}}: The content string - {{num_questions}}: The number of questions to generate

autorag.data.qacreation.ragas module

autorag.data.qacreation.ragas.generate_qa_ragas(corpus_df: DataFrame, test_size: int, distributions: dict | None = None, generator_llm: BaseChatModel | None = None, critic_llm: BaseChatModel | None = None, embedding_model: Embeddings | None = None, **kwargs) DataFrame[source]

QA dataset generation using RAGAS. Returns qa dataset dataframe.

Parameters:
  • corpus_df – Corpus dataframe.

  • test_size – Number of queries to generate.

  • distributions – Distributions of different types of questions. Default is “simple is 0.5, multi_context is 0.4, and reasoning is 0.1.” Each type of questions refers to Ragas evolution types.

  • generator_llm – Generator language model from Langchain.

  • critic_llm – Critic language model from Langchain.

  • embedding_model – Embedding model from Langchain.

  • kwargs – The additional option to pass to the ‘generate_with_langchain_docs’ method. You can input ‘with_debugging_logs’, ‘is_async’, ‘raise_exceptions’, and ‘run_config’.

Returns:

QA dataset dataframe.

autorag.data.qacreation.simple module

autorag.data.qacreation.simple.generate_qa_row(llm: Model, corpus_data_row)[source]

this sample code to generate rag dataset using OpenAI chat model

Parameters:
  • llm – guidance model

  • corpus_data_row – need “contents” column

Returns:

should to be dict which has “query”, “generation_gt” columns at least.

autorag.data.qacreation.simple.generate_simple_qa_dataset(llm: Model, corpus_data: DataFrame, output_filepath: str, generate_row_function: Callable, **kwargs)[source]

corpus_data to qa_dataset qa_dataset will be saved to filepath(file_dir/filename)

Parameters:
  • llm – guidance.models.Model

  • corpus_data – pd.DataFrame. refer to the basic structure

  • output_filepath – file_dir must exist, filepath must not exist. file extension must be .parquet

  • generate_row_function – input(llm, corpus_data_row, kwargs) output(dict[columns contain “query” and “generation_gt”])

  • kwargs – if generate_row_function requires more args, use kwargs

Returns:

qa_dataset as pd.DataFrame

Module contents