autorag.data.chunk package¶
Submodules¶
autorag.data.chunk.base module¶
autorag.data.chunk.langchain_chunk module¶
- autorag.data.chunk.langchain_chunk.langchain_chunk(texts: List[str], chunker: TextSplitter, file_name_language: str | None = None, metadata_list: List[Dict[str, str]] | None = None) Tuple[List[str], List[str], List[str], List[Tuple[int, int]], List[Dict[str, Any]]] [source]¶
Chunk texts from the parsed result to use langchain chunk method
- param texts:
The list of texts to chunk from the parsed result
- param chunker:
A langchain TextSplitter(Chunker) instance.
- param file_name_language:
The language to use ‘add_file_name’ feature. You need to set one of ‘English’ and ‘Korean’ The ‘add_file_name’ feature is to add a file_name to chunked_contents. This is used to prevent hallucination by retrieving contents from the wrong document. Default form of ‘English’ is “file_name: {file_name}
- contents: {content}”
- param metadata_list:
The list of dict of metadata from the parsed result
- return:
tuple of lists containing the chunked doc_id, contents, path, start_idx, end_idx and metadata
autorag.data.chunk.llama_index_chunk module¶
- autorag.data.chunk.llama_index_chunk.llama_index_chunk(texts: List[str], chunker: NodeParser, file_name_language: str | None = None, metadata_list: List[Dict[str, str]] | None = None, batch: int = 8) Tuple[List[str], List[str], List[str], List[Tuple[int, int]], List[Dict[str, Any]]] [source]¶
Chunk texts from the parsed result to use llama index chunk method
- param texts:
The list of texts to chunk from the parsed result
- param chunker:
A llama index NodeParser(Chunker) instance.
- param file_name_language:
The language to use ‘add_file_name’ feature. You need to set one of ‘English’ and ‘Korean’ The ‘add_file_name’ feature is to add a file_name to chunked_contents. This is used to prevent hallucination by retrieving contents from the wrong document. Default form of ‘English’ is “file_name: {file_name}
- contents: {content}”
- param metadata_list:
The list of dict of metadata from the parsed result
- param batch:
The batch size for chunk texts. Default is 8
- return:
tuple of lists containing the chunked doc_id, contents, path, start_idx, end_idx and metadata