Langchain Parse¶

Parse raw documents to use langchain document_loaders.

Available Parse Method by File Type¶

1. PDF¶

Example YAML¶

modules:
  - module_type: langchain_parse
    parse_method: [ pdfminer, pdfplumber ]

2. CSV¶

Example YAML¶

modules:
  - module_type: langchain_parse
    parse_method: csv

3. JSON¶

JSON

❗Must have Parameter¶

jq_schema: JSON Query schema to extract the content from the JSON file.

Example YAML¶

  - module_type: langchain_parse
    parse_method: json
    jq_schema: .messages[].content

4. Markdown¶

UnstructuredMarkdown

Example YAML¶

  - module_type: langchain_parse
    parse_method: unstructuredmarkdown

5. HTML¶

BSHTML

Example YAML¶

  - module_type: langchain_parse
    parse_method: bshtml

6. XML¶

UnstructuredXML

Example YAML¶

  - module_type: langchain_parse
    parse_method: unstructuredxml

7. All files¶

Directory

📌 API Needed¶

You need to have an API key to use the following document loaders.

Unstructured
- UNSTRUCTURED_API_KEY should be set in the environment variable.
UpstageDocumentParseLoader
- UPSTAGE_API_KEY should be set in the environment variable.

Example YAML¶

  - module_type: langchain_parse
    parse_method: upstagedocumentparse

Using Parse Method that is not in the Available Parse Method¶

You can find more information about the document loaders at here

How to Use¶

If you want to use PyPDFDirectoryLoader that is not in the available parse method, you can use the following code.

from autorag.data import parse_modules
from langchain_community.document_loaders import PyPDFDirectoryLoader

parse_modules["pypdfdirectory"] = PyPDFDirectoryLoader

Attention

The key value in parse_modules must always be written in lowercase.