# Chunk

In this section, we will cover how to chunk parsed result.

It is a crucial step because if the parsed result is not chunked well, the RAG will not be optimized well.

Using only YAML files, you can easily use the various chunk methods.
The chunked result is saved according to the data format used by AutoRAG.

## Overview

The sample chunk pipeline looks like this.

```python
from autorag.chunker import Chunker

chunker = Chunker.from_parquet(parsed_data_path="your/parsed/data/path")
chunker.start_chunking("your/path/to/chunk_config.yaml")
```

## Features

### 1. Add File Name
You need to set one of 'en'(=English) and 'ko'(=Korean) for the `add_file_name` parameter.
The 'add_file_name' feature is to add a file_name to chunked_contents.
This is used to prevent hallucination by retrieving contents from the wrong document.
Default form of English is `"file_name: {file_name}\n contents: {content}"`

#### Example YAML

```yaml
modules:
  - module_type: llama_index_chunk
    chunk_method: [ Token, Sentence ]
    chunk_size: [ 1024, 512 ]
    chunk_overlap: 24
    add_file_name: english
```

### 2. Sentence Splitter

The following chunk methods in the `llama_index_chunk` module use the sentence splitter.

- `Semantic_llama_index`
- `SemanticDoubling`
- `SentenceWindow`

The following methods use `PunktSentenceTokenizer` as the default sentence splitter.

See below for the available languages of `PunktSentenceTokenizer`.

["Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Italian, Malayalam, Norwegian, Polish, Portuguese, Russian, Slovenian, Spanish, Swedish, Turkish"]

So if the language you want to use is not in the list, or you want to use a different sentence splitter, you can use the sentence_splitter parameter.

#### Available Sentence Splitter
- [kiwi](https://github.com/bab2min/kiwipiepy) : For Korean 🇰🇷

#### Example YAML

```yaml
modules:
  - module_type: llama_index_chunk
    chunk_method: [ SentenceWindow ]
    sentence_splitter: kiwi
    window_size: 3
    add_file_name: english
```

#### Using sentence splitter that is not in the Available Sentence Splitter

If you want to use `kiwi`, you can use the following code.

```python
from autorag.data import sentence_splitter_modules, LazyInit

def split_by_sentence_kiwi() -> Callable[[str], List[str]]:
	from kiwipiepy import Kiwi

	kiwi = Kiwi()

	def split(text: str) -> List[str]:
		kiwi_result = kiwi.split_into_sents(text)
		sentences = list(map(lambda x: x.text, kiwi_result))

		return sentences

	return split

sentence_splitter_modules["kiwi"] = LazyInit(split_by_sentence_kiwi)
```

## Run Chunk Pipeline

### 1. Set chunker instance

```python
from autorag.chunker import Chunker

chunker = Chunker.from_parquet(parsed_data_path="your/parsed/data/path")
```

```{admonition} Want to specify project folder?
You can specify project directory with `--project_dir` option or project_dir parameter.
```

### 2. Set YAML file

Here is an example of how to use the `llama_index_chunk` module.

```yaml
modules:
  - module_type: llama_index_chunk
    chunk_method: [ Token, Sentence ]
    chunk_size: [ 1024, 512 ]
    chunk_overlap: 24
```

### 3. Start chunking

Use `start_chunking` function to start parsing.

```python
chunker.start_chunking("your/path/to/chunk_config.yaml")
```

### 4. Check the result

If you set `project_dir` parameter, you can check the result in the project directory.
If not, you can check the result in the current directory.

The way to check the result is the same as the `Evaluator` and `Parser` in AutoRAG.

A `trial_folder` is created in `project_dir` first.

If the chunking is completed successfully, the following three types of files are created in the trial_folder.

1. Chunked Result
2. Used YAML file
3. Summary file

For example, if chunking is performed using three chunk methods, the following files are created.
`0.parquet`, `1.parquet`, `2.parquet`, `parse_config.yaml`, `summary.csv`

Finally, in the summary.csv file, you can see information about the chunked result, such as what chunk method was used to chunk it.

## Output Columns
- `doc_id`: Document ID. The type is string.
- `contents`: The contents of the chunked data. The type is string.
- `path`: The path of the document. The type is string.
- `start_end_idx`:
  - Store index of chunked_str based on original_str before chunking
  - stored to map the retrieval_gt of Evaluation QA Dataset according to various chunk methods.
- `metadata`: It is also stored in the passage after the data of the parsed result is chunked. The type is dictionary.
  - Depending on the dataformat of AutoRAG's `Parsed Result`, metadata should have the following keys: `page`, `last_modified_datetime`, `path`.

#### Supported Chunk Modules

📌 You can check our all Chunk modules
at [here](https://edai.notion.site/Supporting-Chunk-Modules-8db803dba2ec4cd0a8789659106e86a3?pvs=4)

```{toctree}
---
maxdepth: 1
---
langchain_chunk.md
llama_index_chunk.md
```