Chunk

In this section, we will cover how to chunk parsed result.

It is a crucial step because if the parsed result is not chunked well, the RAG will not be optimized well.

Using only YAML files, you can easily use the various chunk methods. The chunked result is saved according to the data format used by AutoRAG.

Overview

The sample chunk pipeline looks like this.

from autorag.chunker import Chunker

chunker = Chunker.from_parquet(parsed_data_path="your/parsed/data/path")
chunker.start_chunking("your/path/to/chunk_config.yaml")

Features

1. Add File Name

You need to set one of ‘en’(=English) and ‘ko’(=Korean) for the add_file_name parameter. The ‘add_file_name’ feature is to add a file_name to chunked_contents. This is used to prevent hallucination by retrieving contents from the wrong document. Default form of English is "file_name: {file_name}\n contents: {content}"

Example YAML

modules:
  - module_type: llama_index_chunk
    chunk_method: [ Token, Sentence ]
    chunk_size: [ 1024, 512 ]
    chunk_overlap: 24
    add_file_name: english

2. Sentence Splitter

The following chunk methods in the llama_index_chunk module use the sentence splitter.

  • Semantic_llama_index

  • SemanticDoubling

  • SentenceWindow

The following methods use PunktSentenceTokenizer as the default sentence splitter.

See below for the available languages of PunktSentenceTokenizer.

[“Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Italian, Malayalam, Norwegian, Polish, Portuguese, Russian, Slovenian, Spanish, Swedish, Turkish”]

So if the language you want to use is not in the list, or you want to use a different sentence splitter, you can use the sentence_splitter parameter.

Available Sentence Splitter

  • kiwi : For Korean 🇰🇷

Example YAML

modules:
  - module_type: llama_index_chunk
    chunk_method: [ SentenceWindow ]
    sentence_splitter: kiwi
    window_size: 3
    add_file_name: english

Using sentence splitter that is not in the Available Sentence Splitter

If you want to use kiwi, you can use the following code.

from autorag.data import sentence_splitter_modules, LazyInit

def split_by_sentence_kiwi() -> Callable[[str], List[str]]:
	from kiwipiepy import Kiwi

	kiwi = Kiwi()

	def split(text: str) -> List[str]:
		kiwi_result = kiwi.split_into_sents(text)
		sentences = list(map(lambda x: x.text, kiwi_result))

		return sentences

	return split

sentence_splitter_modules["kiwi"] = LazyInit(split_by_sentence_kiwi)

Run Chunk Pipeline

1. Set chunker instance

from autorag.chunker import Chunker

chunker = Chunker.from_parquet(parsed_data_path="your/parsed/data/path")

Want to specify project folder?

You can specify project directory with --project_dir option or project_dir parameter.

2. Set YAML file

Here is an example of how to use the llama_index_chunk module.

modules:
  - module_type: llama_index_chunk
    chunk_method: [ Token, Sentence ]
    chunk_size: [ 1024, 512 ]
    chunk_overlap: 24

3. Start chunking

Use start_chunking function to start parsing.

chunker.start_chunking("your/path/to/chunk_config.yaml")

4. Check the result

If you set project_dir parameter, you can check the result in the project directory. If not, you can check the result in the current directory.

The way to check the result is the same as the Evaluator and Parser in AutoRAG.

A trial_folder is created in project_dir first.

If the chunking is completed successfully, the following three types of files are created in the trial_folder.

  1. Chunked Result

  2. Used YAML file

  3. Summary file

For example, if chunking is performed using three chunk methods, the following files are created. 0.parquet, 1.parquet, 2.parquet, parse_config.yaml, summary.csv

Finally, in the summary.csv file, you can see information about the chunked result, such as what chunk method was used to chunk it.

Output Columns

  • doc_id: Document ID. The type is string.

  • contents: The contents of the chunked data. The type is string.

  • path: The path of the document. The type is string.

  • start_end_idx:

    • Store index of chunked_str based on original_str before chunking

    • stored to map the retrieval_gt of Evaluation QA Dataset according to various chunk methods.

  • metadata: It is also stored in the passage after the data of the parsed result is chunked. The type is dictionary.

    • Depending on the dataformat of AutoRAG’s Parsed Result, metadata should have the following keys: page, last_modified_datetime, path.

Supported Chunk Modules

📌 You can check our all Chunk modules at here