Langchain Parse

Parse raw documents to use langchain document_loaders.

Available Parse Method by File Type

1. PDF

Example YAML

modules:
  - module_type: langchain_parse
    parse_method: [ pdfminer, pdfplumber ]

2. CSV

Example YAML

modules:
  - module_type: langchain_parse
    parse_method: csv

3. JSON

❗Must have Parameter

  • jq_schema: JSON Query schema to extract the content from the JSON file.

Example YAML

  - module_type: langchain_parse
    parse_method: json
    jq_schema: .messages[].content

4. Markdown

Example YAML

  - module_type: langchain_parse
    parse_method: unstructuredmarkdown

5. HTML

Example YAML

  - module_type: langchain_parse
    parse_method: bshtml

6. XML

Example YAML

  - module_type: langchain_parse
    parse_method: unstructuredxml

7. All files

📌 API Needed

You need to have an API key to use the following document loaders.

Example YAML

  - module_type: langchain_parse
    parse_method: upstagelayoutanalysis

Using Parse Method that is not in the Available Parse Method

You can find more information about the document loaders at here

How to Use

If you want to use PyPDFDirectoryLoader that is not in the available parse method, you can use the following code.

from autorag.data import parse_modules
from langchain_community.document_loaders import PyPDFDirectoryLoader

parse_modules["pypdfdirectory"] = PyPDFDirectoryLoader

Attention

The key value in parse_modules must always be written in lowercase.