Langchain Parse¶
Parse raw documents to use langchain document_loaders.
Available Parse Method by File Type¶
1. PDF¶
Example YAML¶
modules:
- module_type: langchain_parse
parse_method: [ pdfminer, pdfplumber ]
2. CSV¶
Example YAML¶
modules:
- module_type: langchain_parse
parse_method: csv
3. JSON¶
❗Must have Parameter¶
jq_schema: JSON Query schema to extract the content from the JSON file.
Example YAML¶
- module_type: langchain_parse
parse_method: json
jq_schema: .messages[].content
4. Markdown¶
Example YAML¶
- module_type: langchain_parse
parse_method: unstructuredmarkdown
5. HTML¶
Example YAML¶
- module_type: langchain_parse
parse_method: bshtml
6. XML¶
Example YAML¶
- module_type: langchain_parse
parse_method: unstructuredxml
7. All files¶
📌 API Needed¶
You need to have an API key to use the following document loaders.
-
UNSTRUCTURED_API_KEY
should be set in the environment variable.
-
UPSTAGE_API_KEY
should be set in the environment variable.
Example YAML¶
- module_type: langchain_parse
parse_method: upstagedocumentparse
Using Parse Method that is not in the Available Parse Method¶
You can find more information about the document loaders at here
How to Use¶
If you want to use PyPDFDirectoryLoader
that is not in the available parse method, you can use the following code.
from autorag.data import parse_modules
from langchain_community.document_loaders import PyPDFDirectoryLoader
parse_modules["pypdfdirectory"] = PyPDFDirectoryLoader
Attention
The key value in parse_modules must always be written in lowercase.