Clova¶
Parse raw documents to use Naver Clova OCR.
Clova OCR divides the document into pages for parsing.
Table Detection¶
If you have tables in your raw document, set table_detection: true
to use clova ocr table detection feature.
Point¶
1. HTML Parser¶
Clova OCR provides parsed table information in complex JSON format. It converts the complex JSON form of the table to HTML for storage in the LLM.
The parser was created by our own AutoRAG team and you can find the detailed code in the json_to_html_table
function in autorag.data.parse.clova
.
2. The text information comes separately from the table information.¶
If your document is a table + text, the text information comes separately from the table information.
So when using table_detection, it will be saved in {text}\n\ntable html:\n{table}
format.
Example YAML¶
modules:
- module_type: clova
table_detection: true