autorag.nodes.passagefilter package¶
Submodules¶
autorag.nodes.passagefilter.base module¶
autorag.nodes.passagefilter.pass_passage_filter module¶
autorag.nodes.passagefilter.percentile_cutoff module¶
- autorag.nodes.passagefilter.percentile_cutoff.percentile_cutoff(queries: List[str], contents_list: List[List[str]], scores_list: List[List[float]], ids_list: List[List[str]], percentile: float, reverse: bool = False) Tuple[List[List[str]], List[List[str]], List[List[float]]] [source]¶
Filter out the contents that are below the content’s length times percentile. If This is a filter and does not override scores. If the value of content’s length times percentile is less than 1, keep the only one highest similarity content.
- Parameters:
queries – The list of queries to use for filtering
contents_list – The list of lists of contents to filter
scores_list – The list of lists of scores retrieved
ids_list – The list of lists of ids retrieved
percentile – The percentile to cut off
reverse – If True, the lower the score, the better Default is False.
- Returns:
Tuple of lists containing the filtered contents, ids, and scores
autorag.nodes.passagefilter.recency module¶
- autorag.nodes.passagefilter.recency.recency_filter(contents_list: List[List[str]], scores_list: List[List[float]], ids_list: List[List[str]], time_list: List[List[datetime]], threshold: datetime | date) Tuple[List[List[str]], List[List[str]], List[List[float]]] [source]¶
Filter out the contents that are below the threshold datetime. If all contents are filtered, keep the only one recency content. If the threshold date format is incorrect, return the original contents.
- Parameters:
contents_list – The list of lists of contents to filter
scores_list – The list of lists of scores retrieved
ids_list – The list of lists of ids retrieved
time_list – The list of lists of datetime retrieved
threshold – The threshold to cut off
- Returns:
Tuple of lists containing the filtered contents, ids, and scores
autorag.nodes.passagefilter.run module¶
- autorag.nodes.passagefilter.run.run_passage_filter_node(modules: List[Callable], module_params: List[Dict], previous_result: DataFrame, node_line_dir: str, strategies: Dict) DataFrame [source]¶
Run evaluation and select the best module among passage filter node results.
- Parameters:
modules – Passage filter modules to run.
module_params – Passage filter module parameters.
previous_result – Previous result dataframe. Could be retrieval, reranker, passage filter modules result. It means it must contain ‘query’, ‘retrieved_contents’, ‘retrieved_ids’, ‘retrieve_scores’ columns.
node_line_dir – This node line’s directory.
strategies – Strategies for passage filter node. In this node, we use ‘retrieval_f1’, ‘retrieval_recall’ and ‘retrieval_precision’. You can skip evaluation when you use only one module and a module parameter.
- Returns:
The best result dataframe with previous result columns.
autorag.nodes.passagefilter.similarity_percentile_cutoff module¶
- autorag.nodes.passagefilter.similarity_percentile_cutoff.similarity_percentile_cutoff(queries: List[str], contents_list: List[List[str]], scores_list: List[List[float]], ids_list: List[List[str]], percentile: float, embedding_model: str | None = None, batch: int = 128) Tuple[List[List[str]], List[List[str]], List[List[float]]] [source]¶
Re-calculate each content’s similarity with the query and filter out the contents that are below the content’s length times percentile. If This is a filter and does not override scores. The output of scores is not coming from query-content similarity. If the value of content’s length times percentile is less than 1, keep the only one highest similarity content.
- Parameters:
queries – The list of queries to use for filtering
contents_list – The list of lists of contents to filter
scores_list – The list of lists of scores retrieved
ids_list – The list of lists of ids retrieved
percentile – The percentile to cut off
embedding_model – The embedding model to use for calculating similarity Default is OpenAIEmbedding.
batch – The number of queries to be processed in a batch Default is 128.
- Returns:
Tuple of lists containing the filtered contents, ids, and scores
- autorag.nodes.passagefilter.similarity_percentile_cutoff.similarity_percentile_cutoff_pure(query_embedding: str, content_embeddings: List[List[float]], content_list: List[str], ids_list: List[str], scores_list: List[float], percentile: float) Tuple[List[str], List[str], List[float]] [source]¶
Return tuple of lists containing the filtered contents, ids, and scores
- Parameters:
query_embedding – Query embedding
content_embeddings – Each content embedding
content_list – Each content
ids_list – Each id
scores_list – Each score
percentile – The percentile to cut off
- Returns:
Tuple of lists containing the filtered contents, ids, and scores
autorag.nodes.passagefilter.similarity_threshold_cutoff module¶
- autorag.nodes.passagefilter.similarity_threshold_cutoff.similarity_threshold_cutoff(queries: List[str], contents_list: List[List[str]], scores_list: List[List[float]], ids_list: List[List[str]], threshold: float, embedding_model: str | None = None, batch: int = 128) Tuple[List[List[str]], List[List[str]], List[List[float]]] [source]¶
Re-calculate each content’s similarity with the query and filter out the contents that are below the threshold. If all contents are filtered, keep the only one highest similarity content. This is a filter and does not override scores. The output of scores is not coming from query-content similarity.
- Parameters:
queries – The list of queries to use for filtering
contents_list – The list of lists of contents to filter
scores_list – The list of lists of scores retrieved
ids_list – The list of lists of ids retrieved
threshold – The threshold to cut off
embedding_model – The embedding model to use for calculating similarity Default is OpenAIEmbedding.
batch – The number of queries to be processed in a batch Default is 128.
- Returns:
Tuple of lists containing the filtered contents, ids, and scores
- autorag.nodes.passagefilter.similarity_threshold_cutoff.similarity_threshold_cutoff_pure(query_embedding: str, content_embeddings: List[List[float]], threshold: float) List[int] [source]¶
Return indices that have to remain. Return at least one index if there is nothing to remain.
- Parameters:
query_embedding – Query embedding
content_embeddings – Each content embedding
threshold – The threshold to cut off
- Returns:
Indices to remain at the contents
autorag.nodes.passagefilter.threshold_cutoff module¶
- autorag.nodes.passagefilter.threshold_cutoff.threshold_cutoff(queries: List[str], contents_list: List[List[str]], scores_list: List[List[float]], ids_list: List[List[str]], threshold: float, reverse: bool = False) Tuple[List[List[str]], List[List[str]], List[List[float]]] [source]¶
Filters the contents, scores, and ids based on a previous result’s score. Keeps at least one item per query if all scores are below the threshold.
- Parameters:
queries – List of query strings (not used in the current implementation).
contents_list – List of content strings for each query.
scores_list – List of scores for each content.
ids_list – List of ids for each content.
threshold – The minimum score to keep an item.
reverse – If True, the lower the score, the better. Default is False.
- Returns:
Filtered lists of contents, ids, and scores.
- autorag.nodes.passagefilter.threshold_cutoff.threshold_cutoff_pure(scores_list: List[float], threshold: float, reverse: bool = False) List[int] [source]¶
Return indices that have to remain. Return at least one index if there is nothing to remain.
- Parameters:
scores_list – Each score
threshold – The threshold to cut off
reverse – If True, the lower the score, the better Default is False.
- Returns:
Indices to remain at the contents