Query Generation¶

In this document, we will cover how to generate questions for the QA dataset.

Overview¶

You can use batch_apply function at QA instance to generate questions. Before generating a question, the QA must have the qid and retrieval_gt columns. You can get those to use sample at the Corpus instance.

Attention

In OpenAI version of data creation, you can use only ‘gpt-4o-2024-08-06’ and ‘gpt-4o-mini-2024-07-18’. If you want to use another model, use llama_index version instead.

Question types¶

Factoid
Concept Completion
Two-hop Incremental

1. Factoid¶

Factoid questions are those seeking brief, factual information that can be easily verified. They typically require a yes or no answer or a brief explanation and often inquire about specific details such as dates, names, places, or events.

It supports “en” and “ko” languages.

Factoid Example¶

What is the capital of France?
Who invented the light bulb?
When was Wikipedia founded?

Usage¶

OpenAI

from openai import AsyncOpenAI
from autorag.data.beta.schema import QA
from autorag.data.beta.query.openai_gen_query import factoid_query_gen

openai_client = AsyncOpenAI()
qa = QA(qa_df)
result_qa = qa.batch_apply(factoid_query_gen, client=openai_client, lang="ko")

LlamaIndex

from llama_index.llms.openai import OpenAI
from autorag.data.beta.schema import QA
from autorag.data.beta.query.llama_gen_query import factoid_query_gen

llm = OpenAI()
qa = QA(qa_df)
result_qa = qa.batch_apply(factoid_query_gen, llm=llm, lang="ko")

2. Concept Completion¶

A “concept completion” question asks directly about the essence or identity of a concept.

It supports “en” and “ko” languages.

Usage¶

OpenAI

from openai import AsyncOpenAI
from autorag.data.beta.schema import QA
from autorag.data.beta.query.openai_gen_query import concept_completion_query_gen

openai_client = AsyncOpenAI()
qa = QA(qa_df)
result_qa = qa.batch_apply(concept_completion_query_gen, client=openai_client, lang="ko")

LlamaIndex

from llama_index.llms.openai import OpenAI
from autorag.data.beta.schema import QA
from autorag.data.beta.query.llama_gen_query import concept_completion_query_gen

llm = OpenAI()
qa = QA(qa_df)
result_qa = qa.batch_apply(concept_completion_query_gen, llm=llm, lang="ko")

3. Two-hop Incremental¶

This query generation method is coming from this paper. For making a robust multi-hop question, it first selects what will be the answer. Then, it generates a question from the first document. After that, it evolves a question from the second document to the multi-hop question.

We recommend you to use openai version, because it is more stable at the result. It uses structured output.

You can use “en” and “ko” language.

Example¶

In which Mexican state can one find the Ciudad Deportiva, home to the Tecolotes de Nuevo Laredo?
Which group has more members, New Jeans or Aespa?
What is the name of the first album released by the band that performed at the 2022 Super Bowl halftime show?

Usage¶

OpenAI

from openai import AsyncOpenAI
from autorag.data.beta.schema import QA
from autorag.data.beta.query.openai_gen_query import two_hop_incremental

openai_client = AsyncOpenAI()
qa = QA(qa_df)
result_qa = qa.batch_apply(two_hop_incremental, client=openai_client)

LlamaIndex

from llama_index.llms.openai import OpenAI
from autorag.data.beta.schema import QA
from autorag.data.beta.query.llama_gen_query import two_hop_incremental

llm = OpenAI()
qa = QA(qa_df)
result_qa = qa.batch_apply(two_hop_incremental, llm=llm)