autorag.data.legacy.corpus package¶
Submodules¶
autorag.data.legacy.corpus.langchain module¶
- autorag.data.legacy.corpus.langchain.langchain_documents_to_parquet(langchain_documents: List[Document], output_filepath: str | None = None, upsert: bool = False) DataFrame [source]¶
Langchain documents to corpus dataframe. Corpus dataframe will be saved to filepath(file_dir/filename) if given. Return corpus dataframe whether the filepath is given. You can use this method to create corpus.parquet after load and chunk using Llama Index.
- Parameters:
langchain_documents – List of langchain documents.
output_filepath – Optional filepath to save the parquet file. If None, the function will return the processed_data as pd.DataFrame, but do not save as parquet. File directory must exist. File extension must be .parquet
upsert – If true, the function will overwrite the existing file if it exists. Default is False.
- Returns:
Corpus data as pd.DataFrame
autorag.data.legacy.corpus.llama_index module¶
- autorag.data.legacy.corpus.llama_index.llama_documents_to_parquet(llama_documents: List[Document], output_filepath: str | None = None, upsert: bool = False) DataFrame [source]¶
Llama Index documents to corpus dataframe. Corpus dataframe will be saved to filepath(file_dir/filename) if given. Return corpus dataframe whether the filepath is given. You can use this method to create corpus.parquet after load and chunk using Llama Index.
- Parameters:
llama_documents – List[Document]
output_filepath – Optional filepath to save the parquet file. If None, the function will return the processed_data as pd.DataFrame, but do not save as parquet. File directory must exist. File extension must be .parquet
upsert – If true, the function will overwrite the existing file if it exists. Default is False.
- Returns:
Corpus data as pd.DataFrame
- autorag.data.legacy.corpus.llama_index.llama_text_node_to_parquet(text_nodes: List[TextNode], output_filepath: str | None = None, upsert: bool = False) DataFrame [source]¶
Llama Index text nodes to corpus dataframe. Corpus dataframe will be saved to filepath(file_dir/filename) if given. Return corpus dataframe whether the filepath is given. You can use this method to create corpus.parquet after load and chunk using Llama Index.
- Parameters:
text_nodes – List of llama index text nodes.
output_filepath – Optional filepath to save the parquet file. If None, the function will return the processed_data as pd.DataFrame, but do not save as parquet. File directory must exist. File extension must be .parquet
upsert – If true, the function will overwrite the existing file if it exists. Default is False.
- Returns:
Corpus data as pd.DataFrame