autorag.nodes.retrieval package

Submodules

autorag.nodes.retrieval.base module

class autorag.nodes.retrieval.base.BaseRetrieval(project_dir: str, *args, **kwargs)[source]

Bases: BaseModule

cast_to_run(previous_result: DataFrame, *args, **kwargs)[source]

This function is for cast function (a.k.a decorator) only for pure function in the whole node.

class autorag.nodes.retrieval.base.HybridRetrieval(project_dir: str, target_modules, target_module_params, *args, **kwargs)[source]

Bases: BaseRetrieval

pure(previous_result: DataFrame, *args, **kwargs)[source]
autorag.nodes.retrieval.base.cast_queries(queries: str | List[str]) List[str][source]
autorag.nodes.retrieval.base.evenly_distribute_passages(ids: List[List[str]], scores: List[List[float]], top_k: int) Tuple[List[str], List[float]][source]
autorag.nodes.retrieval.base.get_bm25_pkl_name(bm25_tokenizer: str)[source]

autorag.nodes.retrieval.bm25 module

class autorag.nodes.retrieval.bm25.BM25(project_dir: str, *args, **kwargs)[source]

Bases: BaseRetrieval

pure(previous_result: DataFrame, *args, **kwargs)[source]
autorag.nodes.retrieval.bm25.bm25_ingest(corpus_path: str, corpus_data: DataFrame, bm25_tokenizer: str = 'porter_stemmer')[source]
async autorag.nodes.retrieval.bm25.bm25_pure(queries: List[str], top_k: int, tokenizer, bm25_api: BM25Okapi, bm25_corpus: Dict) Tuple[List[str], List[float]][source]

Async BM25 retrieval function. Its usage is for async retrieval of bm25 row by row.

Parameters:
  • queries – A list of query strings.

  • top_k – The number of passages to be retrieved.

  • tokenizer – A tokenizer that will be used to tokenize queries.

  • bm25_api – A bm25 api instance that will be used to retrieve passages.

  • bm25_corpus

    A dictionary containing the bm25 corpus, which is doc_id from corpus and tokenized corpus. Its data structure looks like this:

    {
        "tokens": [], # 2d list of tokens
        "passage_id": [], # 2d list of passage_id. Type must be str.
    }
    

Returns:

The tuple contains a list of passage ids that retrieved from bm25 and its scores.

autorag.nodes.retrieval.bm25.get_bm25_scores(queries: List[str], ids: List[str], tokenizer, bm25_api: BM25Okapi, bm25_corpus: Dict) List[float][source]
autorag.nodes.retrieval.bm25.load_bm25_corpus(bm25_path: str) Dict[source]
autorag.nodes.retrieval.bm25.select_bm25_tokenizer(bm25_tokenizer: str) Callable[[str], List[int | str]][source]
autorag.nodes.retrieval.bm25.tokenize(queries: List[str], tokenizer) List[List[int]][source]
autorag.nodes.retrieval.bm25.tokenize_ja_sudachipy(texts: List[str]) List[List[str]][source]
autorag.nodes.retrieval.bm25.tokenize_ko_kiwi(texts: List[str]) List[List[str]][source]
autorag.nodes.retrieval.bm25.tokenize_ko_kkma(texts: List[str]) List[List[str]][source]
autorag.nodes.retrieval.bm25.tokenize_ko_okt(texts: List[str]) List[List[str]][source]
autorag.nodes.retrieval.bm25.tokenize_porter_stemmer(texts: List[str]) List[List[str]][source]
autorag.nodes.retrieval.bm25.tokenize_space(texts: List[str]) List[List[str]][source]

autorag.nodes.retrieval.hybrid_cc module

class autorag.nodes.retrieval.hybrid_cc.HybridCC(project_dir: str, target_modules, target_module_params, *args, **kwargs)[source]

Bases: HybridRetrieval

classmethod run_evaluator(project_dir: str | Path, previous_result: DataFrame, *args, **kwargs)[source]
autorag.nodes.retrieval.hybrid_cc.fuse_per_query(semantic_ids: List[str], lexical_ids: List[str], semantic_scores: List[float], lexical_scores: List[float], normalize_method: str, weight: float, top_k: int, semantic_theoretical_min_value: float, lexical_theoretical_min_value: float)[source]
autorag.nodes.retrieval.hybrid_cc.hybrid_cc(ids: Tuple, scores: Tuple, top_k: int, weight: float, normalize_method: str = 'mm', semantic_theoretical_min_value: float = -1.0, lexical_theoretical_min_value: float = 0.0) Tuple[List[List[str]], List[List[float]]][source]

Hybrid CC function. CC (convex combination) is a method to fuse lexical and semantic retrieval results. It is a method that first normalizes the scores of each retrieval result, and then combines them with the given weights. It is uniquer than other retrieval modules, because it does not really execute retrieval, but just fuse the results of other retrieval functions. So you have to run more than two retrieval modules before running this function. And collect ids and scores result from each retrieval module. Make it as tuple and input it to this function.

Parameters:
  • ids – The tuple of ids that you want to fuse. The length of this must be the same as the length of scores. The semantic retrieval ids must be the first index.

  • scores – The retrieve scores that you want to fuse. The length of this must be the same as the length of ids. The semantic retrieval scores must be the first index.

  • top_k – The number of passages to be retrieved.

  • normalize_method

    The normalization method to use. There are some normalization method that you can use at the hybrid cc method. AutoRAG support following.

    • mm: Min-max scaling

    • tmm: Theoretical min-max scaling

    • z: z-score normalization

    • dbsf: 3-sigma normalization

  • weight – The weight value. If the weight is 1.0, it means the weight to the semantic module will be 1.0 and weight to the lexical module will be 0.0.

  • semantic_theoretical_min_value – This value used by tmm normalization method. You can set the theoretical minimum value by yourself. Default is -1.

  • lexical_theoretical_min_value – This value used by tmm normalization method. You can set the theoretical minimum value by yourself. Default is 0.

Returns:

The tuple of ids and fused scores that fused by CC. Plus, the third element is selected weight value.

autorag.nodes.retrieval.hybrid_cc.normalize_dbsf(scores: List[str], fixed_min_value: float = 0)[source]
autorag.nodes.retrieval.hybrid_cc.normalize_mm(scores: List[str], fixed_min_value: float = 0)[source]
autorag.nodes.retrieval.hybrid_cc.normalize_tmm(scores: List[str], fixed_min_value: float)[source]
autorag.nodes.retrieval.hybrid_cc.normalize_z(scores: List[str], fixed_min_value: float = 0)[source]

autorag.nodes.retrieval.hybrid_rrf module

class autorag.nodes.retrieval.hybrid_rrf.HybridRRF(project_dir: str, target_modules, target_module_params, *args, **kwargs)[source]

Bases: HybridRetrieval

classmethod run_evaluator(project_dir: str | Path, previous_result: DataFrame, *args, **kwargs)[source]
autorag.nodes.retrieval.hybrid_rrf.hybrid_rrf(ids: Tuple, scores: Tuple, top_k: int, weight: int = 60, rrf_k: int = -1) Tuple[List[List[str]], List[List[float]]][source]

Hybrid RRF function. RRF (Rank Reciprocal Fusion) is a method to fuse multiple retrieval results. It is common to fuse dense retrieval and sparse retrieval results using RRF. To use this function, you must input ids and scores as tuple. It is more unique than other retrieval modules because it does not really execute retrieval but just fuses the results of other retrieval functions. So you have to run more than two retrieval modules before running this function. And collect ids and scores result from each retrieval module. Make it as a tuple and input it to this function.

Parameters:
  • ids – The tuple of ids that you want to fuse. The length of this must be the same as the length of scores.

  • scores – The retrieve scores that you want to fuse. The length of this must be the same as the length of ids.

  • top_k – The number of passages to be retrieved.

  • weight – Hyperparameter for RRF. It was originally rrf_k value. Default is 60. For more information, please visit our documentation.

  • rrf_k – (Deprecated) Hyperparameter for RRF. It was originally rrf_k value. Will remove at a further version.

Returns:

The tuple of ids and fused scores that are fused by RRF.

autorag.nodes.retrieval.hybrid_rrf.rrf_calculate(row, rrf_k)[source]
autorag.nodes.retrieval.hybrid_rrf.rrf_pure(ids: Tuple, scores: Tuple, rrf_k: int, top_k: int) Tuple[List[str], List[float]][source]

autorag.nodes.retrieval.run module

autorag.nodes.retrieval.run.edit_summary_df_params(summary_df: DataFrame, target_modules, target_module_params) DataFrame[source]
autorag.nodes.retrieval.run.evaluate_retrieval_node(result_df: DataFrame, metric_inputs: List[MetricInput], metrics: List[str] | List[Dict]) DataFrame[source]

Evaluate retrieval node from retrieval node result dataframe.

Parameters:
  • result_df – The result dataframe from a retrieval node.

  • metric_inputs – List of metric input schema for AutoRAG.

  • metrics – Metric list from input strategies.

Returns:

Return result_df with metrics columns. The columns will be ‘retrieved_contents’, ‘retrieved_ids’, ‘retrieve_scores’, and metric names.

autorag.nodes.retrieval.run.find_unique_elems(list1: List[str], list2: List[str]) List[str][source]
autorag.nodes.retrieval.run.get_hybrid_execution_times(lexical_summary, semantic_summary) float[source]
autorag.nodes.retrieval.run.get_ids_and_scores(node_dir: str, filenames: List[str], semantic_summary_df: DataFrame, lexical_summary_df: DataFrame, previous_result) Dict[str, Tuple[List[List[str]], List[List[float]]]][source]
autorag.nodes.retrieval.run.get_scores_by_ids(ids: List[List[str]], module_summary_df: DataFrame, project_dir, previous_result) List[List[float]][source]
autorag.nodes.retrieval.run.optimize_hybrid(hybrid_module_func: Callable, hybrid_module_param: Dict, strategy: Dict, input_metrics: List[MetricInput], project_dir, previous_result)[source]
autorag.nodes.retrieval.run.run_retrieval_node(modules: List, module_params: List[Dict], previous_result: DataFrame, node_line_dir: str, strategies: Dict) DataFrame[source]

Run evaluation and select the best module among retrieval node results.

Parameters:
  • modules – Retrieval modules to run.

  • module_params – Retrieval module parameters.

  • previous_result – Previous result dataframe. Could be query expansion’s best result or qa data.

  • node_line_dir – This node line’s directory.

  • strategies – Strategies for retrieval node.

Returns:

The best result dataframe. It contains previous result columns and retrieval node’s result columns.

autorag.nodes.retrieval.vectordb module

class autorag.nodes.retrieval.vectordb.VectorDB(project_dir: str, vectordb: str = 'default', **kwargs)[source]

Bases: BaseRetrieval

pure(previous_result: DataFrame, *args, **kwargs)[source]
async autorag.nodes.retrieval.vectordb.filter_exist_ids(vectordb: BaseVectorStore, corpus_data: DataFrame) DataFrame[source]
async autorag.nodes.retrieval.vectordb.filter_exist_ids_from_retrieval_gt(vectordb: BaseVectorStore, qa_data: DataFrame, corpus_data: DataFrame) DataFrame[source]
autorag.nodes.retrieval.vectordb.get_id_scores(query_embeddings: List[List[float]], content_embeddings: List[List[float]], similarity_metric: str) List[float][source]

Calculate the highest similarity scores between query embeddings and content embeddings.

Parameters:
  • query_embeddings – A list of lists containing query embeddings.

  • content_embeddings – A list of lists containing content embeddings.

  • similarity_metric – The similarity metric to use (‘l2’, ‘ip’, or ‘cosine’).

Returns:

A list of the highest similarity scores for each content embedding.

autorag.nodes.retrieval.vectordb.run_query_embedding_batch(queries: List[str], embedding_model: BaseEmbedding, batch_size: int) List[List[float]][source]
async autorag.nodes.retrieval.vectordb.vectordb_ingest(vectordb: BaseVectorStore, corpus_data: DataFrame)[source]

Ingest given corpus data to the vectordb. It truncates corpus content when the embedding model is OpenAIEmbedding to the 8000 tokens. Plus, when the corpus content is empty (whitespace), it will be ignored. And if there is a document id that already exists in the collection, it will be ignored.

Parameters:
  • vectordb – A vector stores instance that you want to ingest.

  • corpus_data – The corpus data that contains doc_id and contents columns.

async autorag.nodes.retrieval.vectordb.vectordb_pure(queries: List[str], top_k: int, vectordb: BaseVectorStore) Tuple[List[str], List[float]][source]

Async VectorDB retrieval function. Its usage is for async retrieval of vector_db row by row.

Parameters:
  • query_embeddings – A list of query embeddings.

  • top_k – The number of passages to be retrieved.

  • vectordb – The vector store instance.

Returns:

The tuple contains a list of passage ids that are retrieved from vectordb and a list of its scores.

Module contents