autorag.nodes.retrieval package¶
Submodules¶
autorag.nodes.retrieval.base module¶
- class autorag.nodes.retrieval.base.BaseRetrieval(project_dir: str, *args, **kwargs)[source]¶
Bases:
BaseModule
- class autorag.nodes.retrieval.base.HybridRetrieval(project_dir: str, target_modules, target_module_params, *args, **kwargs)[source]¶
Bases:
BaseRetrieval
autorag.nodes.retrieval.bm25 module¶
- class autorag.nodes.retrieval.bm25.BM25(project_dir: str, *args, **kwargs)[source]¶
Bases:
BaseRetrieval
- autorag.nodes.retrieval.bm25.bm25_ingest(corpus_path: str, corpus_data: DataFrame, bm25_tokenizer: str = 'porter_stemmer')[source]¶
- async autorag.nodes.retrieval.bm25.bm25_pure(queries: List[str], top_k: int, tokenizer, bm25_api: BM25Okapi, bm25_corpus: Dict) Tuple[List[str], List[float]] [source]¶
Async BM25 retrieval function. Its usage is for async retrieval of bm25 row by row.
- Parameters:
queries – A list of query strings.
top_k – The number of passages to be retrieved.
tokenizer – A tokenizer that will be used to tokenize queries.
bm25_api – A bm25 api instance that will be used to retrieve passages.
bm25_corpus –
A dictionary containing the bm25 corpus, which is doc_id from corpus and tokenized corpus. Its data structure looks like this:
{ "tokens": [], # 2d list of tokens "passage_id": [], # 2d list of passage_id. Type must be str. }
- Returns:
The tuple contains a list of passage ids that retrieved from bm25 and its scores.
- autorag.nodes.retrieval.bm25.get_bm25_scores(queries: List[str], ids: List[str], tokenizer, bm25_api: BM25Okapi, bm25_corpus: Dict) List[float] [source]¶
autorag.nodes.retrieval.hybrid_cc module¶
- class autorag.nodes.retrieval.hybrid_cc.HybridCC(project_dir: str, target_modules, target_module_params, *args, **kwargs)[source]¶
Bases:
HybridRetrieval
- autorag.nodes.retrieval.hybrid_cc.fuse_per_query(semantic_ids: List[str], lexical_ids: List[str], semantic_scores: List[float], lexical_scores: List[float], normalize_method: str, weight: float, top_k: int, semantic_theoretical_min_value: float, lexical_theoretical_min_value: float)[source]¶
- autorag.nodes.retrieval.hybrid_cc.hybrid_cc(ids: Tuple, scores: Tuple, top_k: int, weight: float, normalize_method: str = 'mm', semantic_theoretical_min_value: float = -1.0, lexical_theoretical_min_value: float = 0.0) Tuple[List[List[str]], List[List[float]]] [source]¶
Hybrid CC function. CC (convex combination) is a method to fuse lexical and semantic retrieval results. It is a method that first normalizes the scores of each retrieval result, and then combines them with the given weights. It is uniquer than other retrieval modules, because it does not really execute retrieval, but just fuse the results of other retrieval functions. So you have to run more than two retrieval modules before running this function. And collect ids and scores result from each retrieval module. Make it as tuple and input it to this function.
- Parameters:
ids – The tuple of ids that you want to fuse. The length of this must be the same as the length of scores. The semantic retrieval ids must be the first index.
scores – The retrieve scores that you want to fuse. The length of this must be the same as the length of ids. The semantic retrieval scores must be the first index.
top_k – The number of passages to be retrieved.
normalize_method –
The normalization method to use. There are some normalization method that you can use at the hybrid cc method. AutoRAG support following.
mm: Min-max scaling
tmm: Theoretical min-max scaling
z: z-score normalization
dbsf: 3-sigma normalization
weight – The weight value. If the weight is 1.0, it means the weight to the semantic module will be 1.0 and weight to the lexical module will be 0.0.
semantic_theoretical_min_value – This value used by tmm normalization method. You can set the theoretical minimum value by yourself. Default is -1.
lexical_theoretical_min_value – This value used by tmm normalization method. You can set the theoretical minimum value by yourself. Default is 0.
- Returns:
The tuple of ids and fused scores that fused by CC. Plus, the third element is selected weight value.
- autorag.nodes.retrieval.hybrid_cc.normalize_dbsf(scores: List[str], fixed_min_value: float = 0)[source]¶
autorag.nodes.retrieval.hybrid_rrf module¶
- class autorag.nodes.retrieval.hybrid_rrf.HybridRRF(project_dir: str, target_modules, target_module_params, *args, **kwargs)[source]¶
Bases:
HybridRetrieval
- autorag.nodes.retrieval.hybrid_rrf.hybrid_rrf(ids: Tuple, scores: Tuple, top_k: int, weight: int = 60, rrf_k: int = -1) Tuple[List[List[str]], List[List[float]]] [source]¶
Hybrid RRF function. RRF (Rank Reciprocal Fusion) is a method to fuse multiple retrieval results. It is common to fuse dense retrieval and sparse retrieval results using RRF. To use this function, you must input ids and scores as tuple. It is more unique than other retrieval modules because it does not really execute retrieval but just fuses the results of other retrieval functions. So you have to run more than two retrieval modules before running this function. And collect ids and scores result from each retrieval module. Make it as a tuple and input it to this function.
- Parameters:
ids – The tuple of ids that you want to fuse. The length of this must be the same as the length of scores.
scores – The retrieve scores that you want to fuse. The length of this must be the same as the length of ids.
top_k – The number of passages to be retrieved.
weight – Hyperparameter for RRF. It was originally rrf_k value. Default is 60. For more information, please visit our documentation.
rrf_k – (Deprecated) Hyperparameter for RRF. It was originally rrf_k value. Will remove at a further version.
- Returns:
The tuple of ids and fused scores that are fused by RRF.