Query Generation¶
In this document, we will cover how to generate questions for the QA dataset.
Overview¶
You can use batch_apply
function at QA
instance to generate questions.
Before generating a question, the QA
must have the qid
and retrieval_gt
columns.
You can get those to use sample
at the Corpus
instance.
Attention
In OpenAI version of data creation, you can use only ‘gpt-4o-2024-08-06’ and ‘gpt-4o-mini-2024-07-18’. If you want to use another model, use llama_index version instead.
Question types¶
1. Factoid¶
Factoid questions are those seeking brief, factual information that can be easily verified. They typically require a yes or no answer or a brief explanation and often inquire about specific details such as dates, names, places, or events.
It supports “en” and “ko” languages.
Factoid Example¶
What is the capital of France?
Who invented the light bulb?
When was Wikipedia founded?
Usage¶
OpenAI
from openai import AsyncOpenAI
from autorag.data.beta.schema import QA
from autorag.data.beta.query.openai_gen_query import factoid_query_gen
openai_client = AsyncOpenAI()
qa = QA(qa_df)
result_qa = qa.batch_apply(factoid_query_gen, client=openai_client, lang="ko")
LlamaIndex
from llama_index.llms.openai import OpenAI
from autorag.data.beta.schema import QA
from autorag.data.beta.query.llama_gen_query import factoid_query_gen
llm = OpenAI()
qa = QA(qa_df)
result_qa = qa.batch_apply(factoid_query_gen, llm=llm, lang="ko")
2. Concept Completion¶
A “concept completion” question asks directly about the essence or identity of a concept.
It supports “en” and “ko” languages.
Usage¶
OpenAI
from openai import AsyncOpenAI
from autorag.data.beta.schema import QA
from autorag.data.beta.query.openai_gen_query import concept_completion_query_gen
openai_client = AsyncOpenAI()
qa = QA(qa_df)
result_qa = qa.batch_apply(concept_completion_query_gen, client=openai_client, lang="ko")
LlamaIndex
from llama_index.llms.openai import OpenAI
from autorag.data.beta.schema import QA
from autorag.data.beta.query.llama_gen_query import concept_completion_query_gen
llm = OpenAI()
qa = QA(qa_df)
result_qa = qa.batch_apply(concept_completion_query_gen, llm=llm, lang="ko")
3. Two-hop Incremental¶
This query generation method is coming from this paper. For making a robust multi-hop question, it first selects what will be the answer. Then, it generates a question from the first document. After that, it evolves a question from the second document to the multi-hop question.
We recommend you to use openai
version, because it is more stable at the result. It uses structured output.
You can use “en” and “ko” language.
Example¶
In which Mexican state can one find the Ciudad Deportiva, home to the Tecolotes de Nuevo Laredo?
Which group has more members, New Jeans or Aespa?
What is the name of the first album released by the band that performed at the 2022 Super Bowl halftime show?
Usage¶
OpenAI
from openai import AsyncOpenAI
from autorag.data.beta.schema import QA
from autorag.data.beta.query.openai_gen_query import two_hop_incremental
openai_client = AsyncOpenAI()
qa = QA(qa_df)
result_qa = qa.batch_apply(two_hop_incremental, client=openai_client)
LlamaIndex
from llama_index.llms.openai import OpenAI
from autorag.data.beta.schema import QA
from autorag.data.beta.query.llama_gen_query import two_hop_incremental
llm = OpenAI()
qa = QA(qa_df)
result_qa = qa.batch_apply(two_hop_incremental, llm=llm)