Query Generation

In this document, we will cover how to generate questions for the QA dataset.

Overview

You can use batch_apply function at QA instance to generate questions. Before generating a question, the QA must have the qid and retrieval_gt columns. You can get those to use sample at the Corpus instance.

Attention

In OpenAI version of data creation, you can use only ‘gpt-4o-2024-08-06’ and ‘gpt-4o-mini-2024-07-18’. If you want to use another model, use llama_index version instead.

Question types

  1. Factoid

  2. Concept Completion

  3. Two-hop Incremental

  4. Custom

1. Factoid

Factoid questions are those seeking brief, factual information that can be easily verified. They typically require a yes or no answer or a brief explanation and often inquire about specific details such as dates, names, places, or events.

It supports “en” , “ko” or “ja” languages.

Factoid Example

  • What is the capital of France?

  • Who invented the light bulb?

  • When was Wikipedia founded?

Usage

  • OpenAI

from openai import AsyncOpenAI
from autorag.data.qa.schema import QA
from autorag.data.qa.query.openai_gen_query import factoid_query_gen

qa = QA(qa_df)
result_qa = qa.batch_apply(factoid_query_gen, client=AsyncOpenAI(), lang="ko")
  • LlamaIndex

from llama_index.llms.openai import OpenAI
from autorag.data.qa.schema import QA
from autorag.data.qa.query.llama_gen_query import factoid_query_gen

llm = OpenAI()
qa = QA(qa_df)
result_qa = qa.batch_apply(factoid_query_gen, llm=llm, lang="ko")

2. Concept Completion

A “concept completion” question asks directly about the essence or identity of a concept.

It supports “en”, “ko” or “ja” languages.

Usage

  • OpenAI

from openai import AsyncOpenAI
from autorag.data.qa.schema import QA
from autorag.data.qa.query.openai_gen_query import concept_completion_query_gen

qa = QA(qa_df)
result_qa = qa.batch_apply(concept_completion_query_gen, client=AsyncOpenAI(), lang="ko")
  • LlamaIndex

from llama_index.llms.openai import OpenAI
from autorag.data.qa.schema import QA
from autorag.data.qa.query.llama_gen_query import concept_completion_query_gen

llm = OpenAI()
qa = QA(qa_df)
result_qa = qa.batch_apply(concept_completion_query_gen, llm=llm, lang="ko")

3. Two-hop Incremental

This query generation method is coming from this paper. For making a robust multi-hop question, it first selects what will be the answer. Then, it generates a question from the first document. After that, it evolves a question from the second document to the multi-hop question.

We recommend you to use openai version, because it is more stable at the result. It uses structured output.

You can use “en” , “ko” or “ja” language.

Example

  • In which Mexican state can one find the Ciudad Deportiva, home to the Tecolotes de Nuevo Laredo?

  • Which group has more members, New Jeans or Aespa?

  • What is the name of the first album released by the band that performed at the 2022 Super Bowl halftime show?

Usage

  • OpenAI

from openai import AsyncOpenAI
from autorag.data.qa.schema import QA
from autorag.data.qa.query.openai_gen_query import two_hop_incremental

qa = QA(qa_df)
result_qa = qa.batch_apply(two_hop_incremental, client=AsyncOpenAI())
  • LlamaIndex

from llama_index.llms.openai import OpenAI
from autorag.data.qa.schema import QA
from autorag.data.qa.query.llama_gen_query import two_hop_incremental

llm = OpenAI()
qa = QA(qa_df)
result_qa = qa.batch_apply(two_hop_incremental, llm=llm)

4. Custom

You can generate questions with custom prompts. You can generate questions in ways other than the ones AutoRAG provides by default by creating a list of ChatMessage in the llama index, or you can generate questions in languages other than “en”, “ko” and “ja”.

Usage

  • LlamaIndex

from autorag.data.qa.schema import QA
from autorag.data.qa.query.llama_gen_query import custom_query_gen
from llama_index.core.base.llms.types import ChatMessage, MessageRole
from llama_index.llms.openai import OpenAI

messages = [
    ChatMessage(
        role=MessageRole.SYSTEM,
        content="""As an expert AI assistant, create one question that can be directly answered from the provided **Text**.
Ensure question:

- Is clearly based on information within the **Text** and avoids assumptions or external context.
- Excludes any introductory phrases, such as "In the given text...", "Based on the context...", "Here are some questions baesd on context...".
- should include question sentence only.
- Is concise, relevant, and encourages specific, direct answers.
- has to be same as system language.
""",
    )
]
llm = OpenAI()
result_qa = qa.batch_apply(custom_query_gen, llm=llm, messages=messages)