Query Generation¶
In this document, we will cover how to generate questions for the QA dataset.
Overview¶
You can use batch_apply
function at QA
instance to generate questions.
Before generating a question, the QA
must have the qid
and retrieval_gt
columns.
You can get those to use sample
at the Corpus
instance.
Attention
In OpenAI version of data creation, you can use only ‘gpt-4o-2024-08-06’ and ‘gpt-4o-mini-2024-07-18’. If you want to use another model, use llama_index version instead.
Question types¶
1. Factoid¶
Factoid questions are those seeking brief, factual information that can be easily verified. They typically require a yes or no answer or a brief explanation and often inquire about specific details such as dates, names, places, or events.
It supports “en” , “ko” or “ja” languages.
Factoid Example¶
What is the capital of France?
Who invented the light bulb?
When was Wikipedia founded?
Usage¶
OpenAI
from openai import AsyncOpenAI
from autorag.data.qa.schema import QA
from autorag.data.qa.query.openai_gen_query import factoid_query_gen
qa = QA(qa_df)
result_qa = qa.batch_apply(factoid_query_gen, client=AsyncOpenAI(), lang="ko")
LlamaIndex
from llama_index.llms.openai import OpenAI
from autorag.data.qa.schema import QA
from autorag.data.qa.query.llama_gen_query import factoid_query_gen
llm = OpenAI()
qa = QA(qa_df)
result_qa = qa.batch_apply(factoid_query_gen, llm=llm, lang="ko")
2. Concept Completion¶
A “concept completion” question asks directly about the essence or identity of a concept.
It supports “en”, “ko” or “ja” languages.
Usage¶
OpenAI
from openai import AsyncOpenAI
from autorag.data.qa.schema import QA
from autorag.data.qa.query.openai_gen_query import concept_completion_query_gen
qa = QA(qa_df)
result_qa = qa.batch_apply(concept_completion_query_gen, client=AsyncOpenAI(), lang="ko")
LlamaIndex
from llama_index.llms.openai import OpenAI
from autorag.data.qa.schema import QA
from autorag.data.qa.query.llama_gen_query import concept_completion_query_gen
llm = OpenAI()
qa = QA(qa_df)
result_qa = qa.batch_apply(concept_completion_query_gen, llm=llm, lang="ko")
3. Two-hop Incremental¶
This query generation method is coming from this paper. For making a robust multi-hop question, it first selects what will be the answer. Then, it generates a question from the first document. After that, it evolves a question from the second document to the multi-hop question.
We recommend you to use openai
version, because it is more stable at the result. It uses structured output.
You can use “en” , “ko” or “ja” language.
Example¶
In which Mexican state can one find the Ciudad Deportiva, home to the Tecolotes de Nuevo Laredo?
Which group has more members, New Jeans or Aespa?
What is the name of the first album released by the band that performed at the 2022 Super Bowl halftime show?
Usage¶
OpenAI
from openai import AsyncOpenAI
from autorag.data.qa.schema import QA
from autorag.data.qa.query.openai_gen_query import two_hop_incremental
qa = QA(qa_df)
result_qa = qa.batch_apply(two_hop_incremental, client=AsyncOpenAI())
LlamaIndex
from llama_index.llms.openai import OpenAI
from autorag.data.qa.schema import QA
from autorag.data.qa.query.llama_gen_query import two_hop_incremental
llm = OpenAI()
qa = QA(qa_df)
result_qa = qa.batch_apply(two_hop_incremental, llm=llm)
4. Custom¶
You can generate questions with custom prompts. You can generate questions in ways other than the ones AutoRAG provides by default by creating a list of ChatMessage
in the llama index, or you can generate questions in languages other than “en”, “ko” and “ja”.
Usage¶
LlamaIndex
from autorag.data.qa.schema import QA
from autorag.data.qa.query.llama_gen_query import custom_query_gen
from llama_index.core.base.llms.types import ChatMessage, MessageRole
from llama_index.llms.openai import OpenAI
messages = [
ChatMessage(
role=MessageRole.SYSTEM,
content="""As an expert AI assistant, create one question that can be directly answered from the provided **Text**.
Ensure question:
- Is clearly based on information within the **Text** and avoids assumptions or external context.
- Excludes any introductory phrases, such as "In the given text...", "Based on the context...", "Here are some questions baesd on context...".
- should include question sentence only.
- Is concise, relevant, and encourages specific, direct answers.
- has to be same as system language.
""",
)
]
llm = OpenAI()
result_qa = qa.batch_apply(custom_query_gen, llm=llm, messages=messages)