# Query Generation In this document, we will cover how to generate questions for the QA dataset. ## Overview You can use `batch_apply` function at `QA` instance to generate questions. Before generating a question, the `QA` must have the `qid` and `retrieval_gt` columns. You can get those to use `sample` at the `Corpus` instance. ```{attention} In OpenAI version of data creation, you can use only 'gpt-4o-2024-08-06' and 'gpt-4o-mini-2024-07-18'. If you want to use another model, use llama_index version instead. ``` # Question types 1. [Factoid](#1-factoid) 2. [Concept Completion](#2-concept-completion) 3. [Two-hop Incremental](#3-two-hop-incremental) ## 1. Factoid Factoid questions are those seeking brief, factual information that can be easily verified. They typically require a yes or no answer or a brief explanation and often inquire about specific details such as dates, names, places, or events. It supports "en" and "ko" languages. ### Factoid Example - What is the capital of France? - Who invented the light bulb? - When was Wikipedia founded? ### Usage - OpenAI ```python from openai import AsyncOpenAI from autorag.data.beta.schema import QA from autorag.data.beta.query.openai_gen_query import factoid_query_gen openai_client = AsyncOpenAI() qa = QA(qa_df) result_qa = qa.batch_apply(factoid_query_gen, client=openai_client, lang="ko") ``` - LlamaIndex ```python from llama_index.llms.openai import OpenAI from autorag.data.beta.schema import QA from autorag.data.beta.query.llama_gen_query import factoid_query_gen llm = OpenAI() qa = QA(qa_df) result_qa = qa.batch_apply(factoid_query_gen, llm=llm, lang="ko") ``` ## 2. Concept Completion A “concept completion” question asks directly about the essence or identity of a concept. It supports "en" and "ko" languages. ### Usage - OpenAI ```python from openai import AsyncOpenAI from autorag.data.beta.schema import QA from autorag.data.beta.query.openai_gen_query import concept_completion_query_gen openai_client = AsyncOpenAI() qa = QA(qa_df) result_qa = qa.batch_apply(concept_completion_query_gen, client=openai_client, lang="ko") ``` - LlamaIndex ```python from llama_index.llms.openai import OpenAI from autorag.data.beta.schema import QA from autorag.data.beta.query.llama_gen_query import concept_completion_query_gen llm = OpenAI() qa = QA(qa_df) result_qa = qa.batch_apply(concept_completion_query_gen, llm=llm, lang="ko") ``` ## 3. Two-hop Incremental This query generation method is coming from [this paper](https://arxiv.org/pdf/2404.00571). For making a robust multi-hop question, it first selects what will be the answer. Then, it generates a question from the first document. After that, it evolves a question from the second document to the multi-hop question. We recommend you to use `openai` version, because it is more stable at the result. It uses structured output. You can use "en" and "ko" language. ### Example - In which Mexican state can one find the Ciudad Deportiva, home to the Tecolotes de Nuevo Laredo? - Which group has more members, New Jeans or Aespa? - What is the name of the first album released by the band that performed at the 2022 Super Bowl halftime show? ### Usage - OpenAI ```python from openai import AsyncOpenAI from autorag.data.beta.schema import QA from autorag.data.beta.query.openai_gen_query import two_hop_incremental openai_client = AsyncOpenAI() qa = QA(qa_df) result_qa = qa.batch_apply(two_hop_incremental, client=openai_client) ``` - LlamaIndex ```python from llama_index.llms.openai import OpenAI from autorag.data.beta.schema import QA from autorag.data.beta.query.llama_gen_query import two_hop_incremental llm = OpenAI() qa = QA(qa_df) result_qa = qa.batch_apply(two_hop_incremental, llm=llm) ```