Query Generation

In this document, we will cover how to generate questions for the QA dataset.

Overview

You can use batch_apply function at QA instance to generate questions. Before generating a question, the QA must have the qid and retrieval_gt columns. You can get those to use sample at the Corpus instance.

Attention

In OpenAI version of data creation, you can use only ‘gpt-4o-2024-08-06’ and ‘gpt-4o-mini-2024-07-18’. If you want to use another model, use llama_index version instead.

Question types

  1. Factoid

  2. Concept Completion

  3. Two-hop Incremental

1. Factoid

Factoid questions are those seeking brief, factual information that can be easily verified. They typically require a yes or no answer or a brief explanation and often inquire about specific details such as dates, names, places, or events.

It supports “en” and “ko” languages.

Factoid Example

  • What is the capital of France?

  • Who invented the light bulb?

  • When was Wikipedia founded?

Usage

  • OpenAI

from openai import AsyncOpenAI
from autorag.data.beta.schema import QA
from autorag.data.beta.query.openai_gen_query import factoid_query_gen

openai_client = AsyncOpenAI()
qa = QA(qa_df)
result_qa = qa.batch_apply(factoid_query_gen, client=openai_client, lang="ko")
  • LlamaIndex

from llama_index.llms.openai import OpenAI
from autorag.data.beta.schema import QA
from autorag.data.beta.query.llama_gen_query import factoid_query_gen

llm = OpenAI()
qa = QA(qa_df)
result_qa = qa.batch_apply(factoid_query_gen, llm=llm, lang="ko")

2. Concept Completion

A “concept completion” question asks directly about the essence or identity of a concept.

It supports “en” and “ko” languages.

Usage

  • OpenAI

from openai import AsyncOpenAI
from autorag.data.beta.schema import QA
from autorag.data.beta.query.openai_gen_query import concept_completion_query_gen

openai_client = AsyncOpenAI()
qa = QA(qa_df)
result_qa = qa.batch_apply(concept_completion_query_gen, client=openai_client, lang="ko")
  • LlamaIndex

from llama_index.llms.openai import OpenAI
from autorag.data.beta.schema import QA
from autorag.data.beta.query.llama_gen_query import concept_completion_query_gen

llm = OpenAI()
qa = QA(qa_df)
result_qa = qa.batch_apply(concept_completion_query_gen, llm=llm, lang="ko")

3. Two-hop Incremental

This query generation method is coming from this paper. For making a robust multi-hop question, it first selects what will be the answer. Then, it generates a question from the first document. After that, it evolves a question from the second document to the multi-hop question.

We recommend you to use openai version, because it is more stable at the result. It uses structured output.

You can use “en” and “ko” language.

Example

  • In which Mexican state can one find the Ciudad Deportiva, home to the Tecolotes de Nuevo Laredo?

  • Which group has more members, New Jeans or Aespa?

  • What is the name of the first album released by the band that performed at the 2022 Super Bowl halftime show?

Usage

  • OpenAI

from openai import AsyncOpenAI
from autorag.data.beta.schema import QA
from autorag.data.beta.query.openai_gen_query import two_hop_incremental

openai_client = AsyncOpenAI()
qa = QA(qa_df)
result_qa = qa.batch_apply(two_hop_incremental, client=openai_client)
  • LlamaIndex

from llama_index.llms.openai import OpenAI
from autorag.data.beta.schema import QA
from autorag.data.beta.query.llama_gen_query import two_hop_incremental

llm = OpenAI()
qa = QA(qa_df)
result_qa = qa.batch_apply(two_hop_incremental, llm=llm)