Table Hybrid Parse

Parse raw documents using a combination of text and table parsing modules.

Because OCR models are paid models, it can be expensive to OCR-parse all raw documents.

OCR models are primarily used to parse raw documents that contain tables. Therefore, it is cost-effective to parse raw documents that do not contain tables with non-OCR methods and parse raw documents that do contain tables with OCR.

To accomplish this, the Table Hybrid Parse module performs parsing in the following steps

  1. breaks the raw document into pages.

  2. uses table detection to distinguish between pages that contain tables and pages that do not.

  3. pages that do not contain tables are parsed by the text parsing module.

  4. Parses pages that contain tables with the table parsing module.

  5. merge the parsing results to return the final result.

Table Detection

Use PDFPlumber to split pages with and without tables.

Table Parse Available Modules

  • llama_parse

    • You need to add result_type: markdown to the table_params.

  • clova

    • You need to add table_detection: true to the table_params.

  • langchain_parse

    • You need to add parse_method: upstagelayoutanalysis to the table_params.

Parameters

  • text_parse_module: str

    • The module to use for text parsing.

  • text_params: dict

    • parameters for the text parsing module

  • table_parse_module: str

    • The module to use for table parsing.

  • table_params: dict

    • parameters for the table parsing module

Example YAML

If you want to use the langchain_parse module for text parsing and the clova module for table parsing, you can use the table_hybrid_parse module.

modules:
  - module_type: table_hybrid_parse
    text_parse_module: langchain_parse
    text_params:
      parse_method: pdfplumber
    table_parse_module: clova
    table_params:
      table_detection: true