autorag.data.beta package

Subpackages

Submodules

autorag.data.beta.extract_evidence module

autorag.data.beta.sample module

autorag.data.beta.sample.random_single_hop(corpus_df: DataFrame, n: int, random_state: int = 42) DataFrame[source]
autorag.data.beta.sample.range_single_hop(corpus_df: DataFrame, idx_range: Iterable)[source]

autorag.data.beta.schema module

class autorag.data.beta.schema.Corpus(corpus_df: DataFrame | None = None, linked_raw: Raw | None = None)[source]

Bases: object

The Corpus class that stored chunked passages. It can generate qa set, linked with Raw instance.

batch_apply(fn: Callable[[Dict, Any], Awaitable[Dict]], batch_size: int = 32, **kwargs) Corpus[source]
property linked_raw: Raw
map(fn: Callable[[DataFrame, Any], DataFrame], **kwargs) Corpus[source]
sample(fn: Callable[[DataFrame, Any], DataFrame], **kwargs) QA[source]

Sample the corpus for making QA. It selects the subset of the corpus and makes QA set from it. You can generate questions from the created question. It is the first step to make QA set from the corpus. If you select just one passage from each passage, it will be a single-hop QA set. If you select multiple passages from each passage, it will be a multi-hop QA set.

Parameters:

fn – The select function to perform.

It returns QA dataframe. :return: QA instance that is selected. It contains qid and retrieval_gt columns.

to_parquet(save_path: str)[source]

Save the corpus to the AutoRAG compatible parquet file. It is not for the data creation, for running AutoRAG. If you want to save it directly, use the below code. corpus.data.to_parquet(save_path)

Parameters:

save_path – The path to save the corpus.

class autorag.data.beta.schema.QA(qa_df: DataFrame | None = None, linked_corpus: Corpus | None = None)[source]

Bases: object

batch_apply(fn: Callable[[Dict, Any], Awaitable[Dict]], batch_size: int = 32, **kwargs) QA[source]
batch_filter(fn: Callable[[Dict, Any], Awaitable[bool]], batch_size: int = 32, **kwargs) QA[source]
filter(fn: Callable[[Dict, Any], bool], **kwargs) QA[source]
property linked_corpus: Corpus
make_retrieval_gt_contents() QA[source]

Make retrieval_gt_contents column from retrieval_gt column. :return: The QA instance that has a retrieval_gt_contents column.

map(fn: Callable[[DataFrame, Any], DataFrame], **kwargs) QA[source]
to_parquet(qa_save_path: str, corpus_save_path: str)[source]

Save the qa and corpus to the AutoRAG compatible parquet file. It is not for the data creation, for running AutoRAG. If you want to save it directly, use the below code. qa.data.to_parquet(save_path)

Parameters:
  • qa_save_path – The path to save the qa dataset.

  • corpus_save_path – The path to save the corpus.

update_corpus(new_corpus: Corpus) QA[source]

Update linked corpus. Not just replace linked_corpus to the new Corpus, it replaces the whole retrieval_gt to the new corpus using linked_raw. The QA data must have a retrieval_gt column.

Parameters:

new_corpus – Corpus that you want to replace. Must have valid linked_raw and raw_id, raw_start_idx, raw_end_idx columns.

Returns:

The QA instance that updated linked corpus.

class autorag.data.beta.schema.Raw(raw_df: DataFrame | None = None)[source]

Bases: object

The Raw class that stored document parsing results. It can do chunking. It has two column names, ‘raw_id’ and ‘contents’.

batch_apply(fn: Callable[[Dict, Any], Awaitable[Dict]], batch_size: int = 32, **kwargs) Raw[source]
chunk(module_name: str, **module_params) Corpus[source]
flatmap(fn: Callable, **kwargs) Raw[source]
map(fn: Callable[[DataFrame, Any], DataFrame], **kwargs) Raw[source]

Module contents