autorag.data.corpus package

Submodules

autorag.data.corpus.langchain module

autorag.data.corpus.langchain.langchain_documents_to_parquet(langchain_documents: List[Document], output_filepath: str | None = None, upsert: bool = False) DataFrame[source]

Langchain documents to corpus dataframe. Corpus dataframe will be saved to filepath(file_dir/filename) if given. Return corpus dataframe whether the filepath is given. You can use this method to create corpus.parquet after load and chunk using Llama Index.

Parameters:
  • langchain_documents – List of langchain documents.

  • output_filepath – Optional filepath to save the parquet file. If None, the function will return the processed_data as pd.DataFrame, but do not save as parquet. File directory must exist. File extension must be .parquet

  • upsert – If true, the function will overwrite the existing file if it exists. Default is False.

Returns:

Corpus data as pd.DataFrame

autorag.data.corpus.llama_index module

autorag.data.corpus.llama_index.llama_documents_to_parquet(llama_documents: List[Document], output_filepath: str | None = None, upsert: bool = False) DataFrame[source]

Llama Index documents to corpus dataframe. Corpus dataframe will be saved to filepath(file_dir/filename) if given. Return corpus dataframe whether the filepath is given. You can use this method to create corpus.parquet after load and chunk using Llama Index.

Parameters:
  • llama_documents – List[Document]

  • output_filepath – Optional filepath to save the parquet file. If None, the function will return the processed_data as pd.DataFrame, but do not save as parquet. File directory must exist. File extension must be .parquet

  • upsert – If true, the function will overwrite the existing file if it exists. Default is False.

Returns:

Corpus data as pd.DataFrame

autorag.data.corpus.llama_index.llama_text_node_to_parquet(text_nodes: List[TextNode], output_filepath: str | None = None, upsert: bool = False) DataFrame[source]

Llama Index text nodes to corpus dataframe. Corpus dataframe will be saved to filepath(file_dir/filename) if given. Return corpus dataframe whether the filepath is given. You can use this method to create corpus.parquet after load and chunk using Llama Index.

Parameters:
  • text_nodes – List of llama index text nodes.

  • output_filepath – Optional filepath to save the parquet file. If None, the function will return the processed_data as pd.DataFrame, but do not save as parquet. File directory must exist. File extension must be .parquet

  • upsert – If true, the function will overwrite the existing file if it exists. Default is False.

Returns:

Corpus data as pd.DataFrame

Module contents