autorag.data.chunk package

Submodules

autorag.data.chunk.base module

autorag.data.chunk.base.add_file_name(file_name_language: str, file_names: List[str], chunk_texts: List[str]) List[str][source]
autorag.data.chunk.base.chunker_node(func)[source]
autorag.data.chunk.base.make_metadata_list(parsed_result: DataFrame) List[Dict[str, str]][source]

autorag.data.chunk.langchain_chunk module

autorag.data.chunk.langchain_chunk.langchain_chunk(texts: List[str], chunker: TextSplitter, file_name_language: str | None = None, metadata_list: List[Dict[str, str]] | None = None) Tuple[List[str], List[str], List[str], List[Tuple[int, int]], List[Dict[str, Any]]][source]

Chunk texts from the parsed result to use langchain chunk method

param texts:

The list of texts to chunk from the parsed result

param chunker:

A langchain TextSplitter(Chunker) instance.

param file_name_language:

The language to use ‘add_file_name’ feature. You need to set one of ‘English’ and ‘Korean’ The ‘add_file_name’ feature is to add a file_name to chunked_contents. This is used to prevent hallucination by retrieving contents from the wrong document. Default form of ‘English’ is “file_name: {file_name}

contents: {content}”
param metadata_list:

The list of dict of metadata from the parsed result

return:

tuple of lists containing the chunked doc_id, contents, path, start_idx, end_idx and metadata

autorag.data.chunk.langchain_chunk.langchain_chunk_pure(text: str, chunker: TextSplitter, file_name_language: str | None = None, _metadata: Dict[str, str] | None = None)[source]

autorag.data.chunk.llama_index_chunk module

autorag.data.chunk.llama_index_chunk.llama_index_chunk(texts: List[str], chunker: NodeParser, file_name_language: str | None = None, metadata_list: List[Dict[str, str]] | None = None, batch: int = 8) Tuple[List[str], List[str], List[str], List[Tuple[int, int]], List[Dict[str, Any]]][source]

Chunk texts from the parsed result to use llama index chunk method

param texts:

The list of texts to chunk from the parsed result

param chunker:

A llama index NodeParser(Chunker) instance.

param file_name_language:

The language to use ‘add_file_name’ feature. You need to set one of ‘English’ and ‘Korean’ The ‘add_file_name’ feature is to add a file_name to chunked_contents. This is used to prevent hallucination by retrieving contents from the wrong document. Default form of ‘English’ is “file_name: {file_name}

contents: {content}”
param metadata_list:

The list of dict of metadata from the parsed result

param batch:

The batch size for chunk texts. Default is 8

return:

tuple of lists containing the chunked doc_id, contents, path, start_idx, end_idx and metadata

async autorag.data.chunk.llama_index_chunk.llama_index_chunk_pure(text: str, chunker: NodeParser, file_name_language: str | None = None, _metadata: Dict[str, str] | None = None)[source]

autorag.data.chunk.run module

autorag.data.chunk.run.run_chunker(modules: List[Callable], module_params: List[Dict], parsed_result: DataFrame, project_dir: str)[source]

Module contents