autorag.data.corpus package¶

Submodules¶

autorag.data.corpus.langchain module¶

autorag.data.corpus.langchain.langchain_documents_to_parquet(langchain_documents: List[Document], output_filepath: str | None = None, upsert: bool = False) → DataFrame[source]¶

Langchain documents to corpus dataframe. Corpus dataframe will be saved to filepath(file_dir/filename) if given. Return corpus dataframe whether the filepath is given. You can use this method to create corpus.parquet after load and chunk using Llama Index.

Parameters:

langchain_documents – List of langchain documents.
output_filepath – Optional filepath to save the parquet file. If None, the function will return the processed_data as pd.DataFrame, but do not save as parquet. File directory must exist. File extension must be .parquet
upsert – If true, the function will overwrite the existing file if it exists. Default is False.

Returns:

Corpus data as pd.DataFrame

autorag.data.corpus.llama_index module¶

autorag.data.corpus.llama_index.llama_documents_to_parquet(llama_documents: List[Document], output_filepath: str | None = None, upsert: bool = False) → DataFrame[source]¶

Llama Index documents to corpus dataframe. Corpus dataframe will be saved to filepath(file_dir/filename) if given. Return corpus dataframe whether the filepath is given. You can use this method to create corpus.parquet after load and chunk using Llama Index.

Parameters:

llama_documents – List[Document]
output_filepath – Optional filepath to save the parquet file. If None, the function will return the processed_data as pd.DataFrame, but do not save as parquet. File directory must exist. File extension must be .parquet
upsert – If true, the function will overwrite the existing file if it exists. Default is False.

Returns:

Corpus data as pd.DataFrame

autorag.data.corpus.llama_index.llama_text_node_to_parquet(text_nodes: List[TextNode], output_filepath: str | None = None, upsert: bool = False) → DataFrame[source]¶

Llama Index text nodes to corpus dataframe. Corpus dataframe will be saved to filepath(file_dir/filename) if given. Return corpus dataframe whether the filepath is given. You can use this method to create corpus.parquet after load and chunk using Llama Index.

Parameters:

text_nodes – List of llama index text nodes.
output_filepath – Optional filepath to save the parquet file. If None, the function will return the processed_data as pd.DataFrame, but do not save as parquet. File directory must exist. File extension must be .parquet
upsert – If true, the function will overwrite the existing file if it exists. Default is False.

Returns:

Corpus data as pd.DataFrame

autorag.data.corpus package¶

Submodules¶

autorag.data.corpus.langchain module¶

autorag.data.corpus.llama_index module¶

Module contents¶