autorag.nodes.passagefilter package

Submodules

autorag.nodes.passagefilter.base module

autorag.nodes.passagefilter.base.passage_filter_node(func)[source]

autorag.nodes.passagefilter.pass_passage_filter module

autorag.nodes.passagefilter.pass_passage_filter.pass_passage_filter(queries: List[str], contents_list: List[List[str]], scores_list: List[List[float]], ids_list: List[List[str]])[source]

Do not perform filtering. Return given passages, scores, and ids as is.

autorag.nodes.passagefilter.percentile_cutoff module

autorag.nodes.passagefilter.percentile_cutoff.percentile_cutoff(queries: List[str], contents_list: List[List[str]], scores_list: List[List[float]], ids_list: List[List[str]], percentile: float, reverse: bool = False) Tuple[List[List[str]], List[List[str]], List[List[float]]][source]

Filter out the contents that are below the content’s length times percentile. If This is a filter and does not override scores. If the value of content’s length times percentile is less than 1, keep the only one highest similarity content.

Parameters:
  • queries – The list of queries to use for filtering

  • contents_list – The list of lists of contents to filter

  • scores_list – The list of lists of scores retrieved

  • ids_list – The list of lists of ids retrieved

  • percentile – The percentile to cut off

  • reverse – If True, the lower the score, the better Default is False.

Returns:

Tuple of lists containing the filtered contents, ids, and scores

autorag.nodes.passagefilter.recency module

autorag.nodes.passagefilter.recency.recency_filter(contents_list: List[List[str]], scores_list: List[List[float]], ids_list: List[List[str]], time_list: List[List[datetime]], threshold: datetime | date) Tuple[List[List[str]], List[List[str]], List[List[float]]][source]

Filter out the contents that are below the threshold datetime. If all contents are filtered, keep the only one recency content. If the threshold date format is incorrect, return the original contents.

Parameters:
  • contents_list – The list of lists of contents to filter

  • scores_list – The list of lists of scores retrieved

  • ids_list – The list of lists of ids retrieved

  • time_list – The list of lists of datetime retrieved

  • threshold – The threshold to cut off

Returns:

Tuple of lists containing the filtered contents, ids, and scores

autorag.nodes.passagefilter.run module

autorag.nodes.passagefilter.run.run_passage_filter_node(modules: List[Callable], module_params: List[Dict], previous_result: DataFrame, node_line_dir: str, strategies: Dict) DataFrame[source]

Run evaluation and select the best module among passage filter node results.

Parameters:
  • modules – Passage filter modules to run.

  • module_params – Passage filter module parameters.

  • previous_result – Previous result dataframe. Could be retrieval, reranker, passage filter modules result. It means it must contain ‘query’, ‘retrieved_contents’, ‘retrieved_ids’, ‘retrieve_scores’ columns.

  • node_line_dir – This node line’s directory.

  • strategies – Strategies for passage filter node. In this node, we use ‘retrieval_f1’, ‘retrieval_recall’ and ‘retrieval_precision’. You can skip evaluation when you use only one module and a module parameter.

Returns:

The best result dataframe with previous result columns.

autorag.nodes.passagefilter.similarity_percentile_cutoff module

autorag.nodes.passagefilter.similarity_percentile_cutoff.similarity_percentile_cutoff(queries: List[str], contents_list: List[List[str]], scores_list: List[List[float]], ids_list: List[List[str]], percentile: float, embedding_model: str | None = None, batch: int = 128) Tuple[List[List[str]], List[List[str]], List[List[float]]][source]

Re-calculate each content’s similarity with the query and filter out the contents that are below the content’s length times percentile. If This is a filter and does not override scores. The output of scores is not coming from query-content similarity. If the value of content’s length times percentile is less than 1, keep the only one highest similarity content.

Parameters:
  • queries – The list of queries to use for filtering

  • contents_list – The list of lists of contents to filter

  • scores_list – The list of lists of scores retrieved

  • ids_list – The list of lists of ids retrieved

  • percentile – The percentile to cut off

  • embedding_model – The embedding model to use for calculating similarity Default is OpenAIEmbedding.

  • batch – The number of queries to be processed in a batch Default is 128.

Returns:

Tuple of lists containing the filtered contents, ids, and scores

autorag.nodes.passagefilter.similarity_percentile_cutoff.similarity_percentile_cutoff_pure(query_embedding: str, content_embeddings: List[List[float]], content_list: List[str], ids_list: List[str], scores_list: List[float], percentile: float) Tuple[List[str], List[str], List[float]][source]

Return tuple of lists containing the filtered contents, ids, and scores

Parameters:
  • query_embedding – Query embedding

  • content_embeddings – Each content embedding

  • content_list – Each content

  • ids_list – Each id

  • scores_list – Each score

  • percentile – The percentile to cut off

Returns:

Tuple of lists containing the filtered contents, ids, and scores

autorag.nodes.passagefilter.similarity_threshold_cutoff module

autorag.nodes.passagefilter.similarity_threshold_cutoff.similarity_threshold_cutoff(queries: List[str], contents_list: List[List[str]], scores_list: List[List[float]], ids_list: List[List[str]], threshold: float, embedding_model: str | None = None, batch: int = 128) Tuple[List[List[str]], List[List[str]], List[List[float]]][source]

Re-calculate each content’s similarity with the query and filter out the contents that are below the threshold. If all contents are filtered, keep the only one highest similarity content. This is a filter and does not override scores. The output of scores is not coming from query-content similarity.

Parameters:
  • queries – The list of queries to use for filtering

  • contents_list – The list of lists of contents to filter

  • scores_list – The list of lists of scores retrieved

  • ids_list – The list of lists of ids retrieved

  • threshold – The threshold to cut off

  • embedding_model – The embedding model to use for calculating similarity Default is OpenAIEmbedding.

  • batch – The number of queries to be processed in a batch Default is 128.

Returns:

Tuple of lists containing the filtered contents, ids, and scores

autorag.nodes.passagefilter.similarity_threshold_cutoff.similarity_threshold_cutoff_pure(query_embedding: str, content_embeddings: List[List[float]], threshold: float) List[int][source]

Return indices that have to remain. Return at least one index if there is nothing to remain.

Parameters:
  • query_embedding – Query embedding

  • content_embeddings – Each content embedding

  • threshold – The threshold to cut off

Returns:

Indices to remain at the contents

autorag.nodes.passagefilter.threshold_cutoff module

autorag.nodes.passagefilter.threshold_cutoff.threshold_cutoff(queries: List[str], contents_list: List[List[str]], scores_list: List[List[float]], ids_list: List[List[str]], threshold: float, reverse: bool = False) Tuple[List[List[str]], List[List[str]], List[List[float]]][source]

Filters the contents, scores, and ids based on a previous result’s score. Keeps at least one item per query if all scores are below the threshold.

Parameters:
  • queries – List of query strings (not used in the current implementation).

  • contents_list – List of content strings for each query.

  • scores_list – List of scores for each content.

  • ids_list – List of ids for each content.

  • threshold – The minimum score to keep an item.

  • reverse – If True, the lower the score, the better. Default is False.

Returns:

Filtered lists of contents, ids, and scores.

autorag.nodes.passagefilter.threshold_cutoff.threshold_cutoff_pure(scores_list: List[float], threshold: float, reverse: bool = False) List[int][source]

Return indices that have to remain. Return at least one index if there is nothing to remain.

Parameters:
  • scores_list – Each score

  • threshold – The threshold to cut off

  • reverse – If True, the lower the score, the better Default is False.

Returns:

Indices to remain at the contents

Module contents