Filtering

After generating QA dataset, you want to filter some generation results. Because LLM is not perfect and has a lot of mistakes while generating datasets, it is good if you use some filtering methods to remove some bad results.

The supported filtering methods are below.

  1. Rule-based Don’t know Filter

  2. LLM-based Don’t know Filter

1. Unanswerable question filtering

Sometimes LLM generates unanswerable questions from the given passage. If unintended unanswerable questions are generated, the retrieval optimization performance will be lower. So, it is great to filter unanswerable questions after generation QA dataset.

Don’t know Filter

At the Don’t know filter, we use generation_gt to classify the question is unanswerable or not. If the question is unanswerable, the generation_gt will be ‘Don’t know.’

Rule-based Don’t know Filter

We can use the rule-based don’t know filter to filter unanswerable questions. This will just use the pre-made don’t know sentences and filter the questions.

This is not perfect, but it is a simple and fast way to filter unanswerable questions.

from autorag.data.qa.schema import QA
from autorag.data.qa.filter.dontknow import dontknow_filter_rule_based

qa = QA(qa_df, corpus)
filtered_qa = qa.filter(dontknow_filter_rule_based, lang="en").map(
    lambda df: df.reset_index(drop=True)  # reset index
)

You can use “en”, “ko” or “ja” language.

LLM-based Don’t know Filter

We can use the LLM for filtering unanswerable questions. It can classify the vague questions as unanswerable. But since it uses the LLM, it is much slower and expensive than the rule-based don’t know filter.

  • OpenAI

from openai import AsyncOpenAI
from autorag.data.qa.schema import QA
from autorag.data.qa.filter.dontknow import dontknow_filter_openai

qa = QA(qa_df, corpus)
filtered_qa = qa.batch_filter(dontknow_filter_openai, client=AsyncOpenAI(), lang="en").map(
    lambda df: df.reset_index(drop=True)  # reset index
)
  • Llama Index

from llama_index.llms.ollama import Ollama
from autorag.data.qa.schema import QA
from autorag.data.qa.filter.dontknow import dontknow_filter_llama_index

llm = Ollama(model="llama3")
qa = QA(qa_df, corpus)
filtered_qa = qa.batch_filter(dontknow_filter_llama_index, llm=llm, lang="en").map(
    lambda df: df.reset_index(drop=True)  # reset index
)

2. Passage Dependent Filtering

Passage-dependent questions are those where the answer varies depending on the passage or context selected. Even if you have the greatest retrieval system, the system will not find the exact passage from the passage-dependent questions.

Since the passage-dependent questions are almost impossible to get a ground truth passage, it will decrease the discriminative power of evaluation dataset.

So, it is good to filter the passage-dependent questions after generating QA dataset. We use LLM as the filtering model.

  • OpenAI

from openai import AsyncOpenAI
from autorag.data.qa.schema import QA
from autorag.data.qa.filter.passage_dependency import passage_dependency_filter_openai

en_qa = QA(en_qa_df)
result_en_qa = en_qa.batch_filter(
    passage_dependency_filter_openai, client=AsyncOpenAI(), lang="en"
).map(lambda df: df.reset_index(drop=True))
  • LlamaIndex

from autorag.data.qa.schema import QA
from llama_index.llms.openai import OpenAI
from autorag.data.qa.filter.passage_dependency import passage_dependency_filter_llama_index

llm = OpenAI(temperature=0, model="gpt-4o-mini")
en_qa = QA(en_qa_df)
result_en_qa = en_qa.batch_filter(
    passage_dependency_filter_llama_index, llm=llm, lang="en"
).map(lambda df: df.reset_index(drop=True))