autorag.nodes.retrieval package

Submodules

autorag.nodes.retrieval.base module

autorag.nodes.retrieval.base.cast_queries(queries: str | List[str]) List[str][source]
autorag.nodes.retrieval.base.evenly_distribute_passages(ids: List[List[str]], scores: List[List[float]], top_k: int) Tuple[List[str], List[float]][source]
autorag.nodes.retrieval.base.get_bm25_pkl_name(bm25_tokenizer: str)[source]
autorag.nodes.retrieval.base.load_bm25_corpus(bm25_path: str) Dict[source]
autorag.nodes.retrieval.base.load_chroma_collection(db_path: str, collection_name: str) Collection[source]
autorag.nodes.retrieval.base.retrieval_node(func)[source]

Load resources for running retrieval_node. For example, it loads bm25 corpus for bm25 retrieval.

Parameters:

func – Retrieval function that returns a list of ids and a list of scores

Returns:

A pandas Dataframe that contains retrieved contents, retrieved ids, and retrieve scores. The column name will be “retrieved_contents”, “retrieved_ids”, and “retrieve_scores”.

autorag.nodes.retrieval.bm25 module

autorag.nodes.retrieval.bm25.bm25(queries: List[List[str]], top_k: int, bm25_corpus: Dict, bm25_tokenizer: str = 'porter_stemmer', ids: List[List[str]] | None = None) Tuple[List[List[str]], List[List[float]]][source]

BM25 retrieval function. You have to load a pickle file that is already ingested.

Parameters:
  • queries – 2-d list of query strings. Each element of the list is a query strings of each row.

  • top_k – The number of passages to be retrieved.

  • bm25_corpus

    A dictionary containing the bm25 corpus, which is doc_id from corpus and tokenized corpus. Its data structure looks like this:

    {
        "tokens": [], # 2d list of tokens
        "passage_id": [], # 2d list of passage_id.
    }
    

  • bm25_tokenizer – The tokenizer name that uses to the BM25. It supports ‘porter_stemmer’, ‘ko_kiwi’, and huggingface AutoTokenizer. You can pass huggingface tokenizer name. Default is porter_stemmer.

  • ids – The optional list of ids that you want to retrieve. You don’t need to specify this in the general use cases. Default is None.

Returns:

The 2-d list contains a list of passage ids that retrieved from bm25 and 2-d list of its scores. It will be a length of queries. And each element has a length of top_k.

autorag.nodes.retrieval.bm25.bm25_ingest(corpus_path: str, corpus_data: DataFrame, bm25_tokenizer: str = 'porter_stemmer')[source]
async autorag.nodes.retrieval.bm25.bm25_pure(queries: List[str], top_k: int, tokenizer, bm25_api: BM25Okapi, bm25_corpus: Dict) Tuple[List[str], List[float]][source]

Async BM25 retrieval function. Its usage is for async retrieval of bm25 row by row.

Parameters:
  • queries – A list of query strings.

  • top_k – The number of passages to be retrieved.

  • tokenizer – A tokenizer that will be used to tokenize queries.

  • bm25_api – A bm25 api instance that will be used to retrieve passages.

  • bm25_corpus

    A dictionary containing the bm25 corpus, which is doc_id from corpus and tokenized corpus. Its data structure looks like this:

    {
        "tokens": [], # 2d list of tokens
        "passage_id": [], # 2d list of passage_id. Type must be str.
    }
    

Returns:

The tuple contains a list of passage ids that retrieved from bm25 and its scores.

autorag.nodes.retrieval.bm25.get_bm25_scores(queries: List[str], ids: List[str], tokenizer, bm25_api: BM25Okapi, bm25_corpus: Dict) List[float][source]
autorag.nodes.retrieval.bm25.select_bm25_tokenizer(bm25_tokenizer: str) Callable[[str], List[int | str]][source]
autorag.nodes.retrieval.bm25.tokenize(queries: List[str], tokenizer) List[List[int]][source]
autorag.nodes.retrieval.bm25.tokenize_ko_kiwi(texts: List[str]) List[List[str]][source]
autorag.nodes.retrieval.bm25.tokenize_ko_kkma(texts: List[str]) List[List[str]][source]
autorag.nodes.retrieval.bm25.tokenize_ko_okt(texts: List[str]) List[List[str]][source]
autorag.nodes.retrieval.bm25.tokenize_porter_stemmer(texts: List[str]) List[List[str]][source]
autorag.nodes.retrieval.bm25.tokenize_space(texts: List[str]) List[List[str]][source]

autorag.nodes.retrieval.hybrid_cc module

autorag.nodes.retrieval.hybrid_cc.fuse_per_query(semantic_ids: List[str], lexical_ids: List[str], semantic_scores: List[float], lexical_scores: List[float], normalize_method: str, weight: float, top_k: int, semantic_theoretical_min_value: float, lexical_theoretical_min_value: float)[source]
autorag.nodes.retrieval.hybrid_cc.hybrid_cc(ids: Tuple, scores: Tuple, top_k: int, weight: float, normalize_method: str = 'mm', semantic_theoretical_min_value: float = -1.0, lexical_theoretical_min_value: float = 0.0) Tuple[List[List[str]], List[List[float]]][source]

Hybrid CC function. CC (convex combination) is a method to fuse lexical and semantic retrieval results. It is a method that first normalizes the scores of each retrieval result, and then combines them with the given weights. It is uniquer than other retrieval modules, because it does not really execute retrieval, but just fuse the results of other retrieval functions. So you have to run more than two retrieval modules before running this function. And collect ids and scores result from each retrieval module. Make it as tuple and input it to this function.

Parameters:
  • ids – The tuple of ids that you want to fuse. The length of this must be the same as the length of scores. The semantic retrieval ids must be the first index.

  • scores – The retrieve scores that you want to fuse. The length of this must be the same as the length of ids. The semantic retrieval scores must be the first index.

  • top_k – The number of passages to be retrieved.

  • normalize_method

    The normalization method to use. There are some normalization method that you can use at the hybrid cc method. AutoRAG support following.

    • mm: Min-max scaling

    • tmm: Theoretical min-max scaling

    • z: z-score normalization

    • dbsf: 3-sigma normalization

  • weight – The weight value. If the weight is 1.0, it means the weight to the semantic module will be 1.0 and weight to the lexical module will be 0.0.

  • semantic_theoretical_min_value – This value used by tmm normalization method. You can set the theoretical minimum value by yourself. Default is -1.

  • lexical_theoretical_min_value – This value used by tmm normalization method. You can set the theoretical minimum value by yourself. Default is 0.

Returns:

The tuple of ids and fused scores that fused by CC. Plus, the third element is selected weight value.

autorag.nodes.retrieval.hybrid_cc.normalize_dbsf(scores: List[str], fixed_min_value: float = 0)[source]
autorag.nodes.retrieval.hybrid_cc.normalize_mm(scores: List[str], fixed_min_value: float = 0)[source]
autorag.nodes.retrieval.hybrid_cc.normalize_tmm(scores: List[str], fixed_min_value: float)[source]
autorag.nodes.retrieval.hybrid_cc.normalize_z(scores: List[str], fixed_min_value: float = 0)[source]

autorag.nodes.retrieval.hybrid_rrf module

autorag.nodes.retrieval.hybrid_rrf.hybrid_rrf(ids: Tuple, scores: Tuple, top_k: int, weight: int = 60, rrf_k: int = -1) Tuple[List[List[str]], List[List[float]]][source]

Hybrid RRF function. RRF (Rank Reciprocal Fusion) is a method to fuse multiple retrieval results. It is common to fuse dense retrieval and sparse retrieval results using RRF. To use this function, you must input ids and scores as tuple. It is uniquer than other retrieval modules, because it does not really execute retrieval, but just fuse the results of other retrieval functions. So you have to run more than two retrieval modules before running this function. And collect ids and scores result from each retrieval module. Make it as tuple and input it to this function.

Parameters:
  • ids – The tuple of ids that you want to fuse. The length of this must be the same as the length of scores.

  • scores – The retrieve scores that you want to fuse. The length of this must be the same as the length of ids.

  • top_k – The number of passages to be retrieved.

  • weight – Hyperparameter for RRF. It was originally rrf_k value. Default is 60. For more information, please visit our documentation.

  • rrf_k – (Deprecated) Hyperparameter for RRF. It was originally rrf_k value. Will remove at further version.

Returns:

The tuple of ids and fused scores that fused by RRF.

autorag.nodes.retrieval.hybrid_rrf.rrf_calculate(row, rrf_k)[source]
autorag.nodes.retrieval.hybrid_rrf.rrf_pure(ids: Tuple, scores: Tuple, rrf_k: int, top_k: int) Tuple[List[str], List[float]][source]

autorag.nodes.retrieval.run module

autorag.nodes.retrieval.run.edit_summary_df_params(summary_df: DataFrame, target_modules, target_module_params) DataFrame[source]
autorag.nodes.retrieval.run.evaluate_retrieval_node(result_df: DataFrame, metric_inputs: List[MetricInput], metrics: List[str] | List[Dict]) DataFrame[source]

Evaluate retrieval node from retrieval node result dataframe.

Parameters:
  • result_df – The result dataframe from a retrieval node.

  • metric_inputs – List of metric input schema for AutoRAG.

  • metrics – Metric list from input strategies.

Returns:

Return result_df with metrics columns. The columns will be ‘retrieved_contents’, ‘retrieved_ids’, ‘retrieve_scores’, and metric names.

autorag.nodes.retrieval.run.find_unique_elems(list1: List[str], list2: List[str]) List[str][source]
autorag.nodes.retrieval.run.get_hybrid_execution_times(lexical_summary, semantic_summary) float[source]
autorag.nodes.retrieval.run.get_ids_and_scores(node_dir: str, filenames: List[str], semantic_summary_df: DataFrame, lexical_summary_df: DataFrame, previous_result) Dict[str, Tuple[List[List[str]], List[List[float]]]][source]
autorag.nodes.retrieval.run.get_scores_by_ids(ids: List[List[str]], module_summary_df: DataFrame, project_dir, previous_result) List[List[float]][source]
autorag.nodes.retrieval.run.optimize_hybrid(hybrid_module_func: Callable, hybrid_module_param: Dict, strategy: Dict, input_metrics: List[MetricInput], project_dir, previous_result)[source]
autorag.nodes.retrieval.run.run_retrieval_node(modules: List[Callable], module_params: List[Dict], previous_result: DataFrame, node_line_dir: str, strategies: Dict) DataFrame[source]

Run evaluation and select the best module among retrieval node results.

Parameters:
  • modules – Retrieval modules to run.

  • module_params – Retrieval module parameters.

  • previous_result – Previous result dataframe. Could be query expansion’s best result or qa data.

  • node_line_dir – This node line’s directory.

  • strategies – Strategies for retrieval node.

Returns:

The best result dataframe. It contains previous result columns and retrieval node’s result columns.

autorag.nodes.retrieval.vectordb module

autorag.nodes.retrieval.vectordb.get_id_scores(ids: List[str], query_embeddings: List[List[float]], collection: Collection, temp_client: Client) List[float][source]
autorag.nodes.retrieval.vectordb.run_query_embedding_batch(queries: List[str], embedding_model: BaseEmbedding, batch_size: int) List[List[float]][source]
autorag.nodes.retrieval.vectordb.vectordb(queries: List[List[str]], top_k: int, collection: Collection, embedding_model: BaseEmbedding, embedding_batch: int = 128, ids: List[List[str]] | None = None) Tuple[List[List[str]], List[List[float]]][source]

VectorDB retrieval function. You have to get a chroma collection that is already ingested. You have to get an embedding model that is already used in ingesting.

Parameters:
  • queries – 2-d list of query strings. Each element of the list is a query strings of each row.

  • top_k – The number of passages to be retrieved.

  • collection – A chroma collection instance that will be used to retrieve passages.

  • embedding_model – An embedding model instance that will be used to embed queries.

  • embedding_batch – The number of queries to be processed in parallel. This is used to prevent API error at the query embedding. Default is 128.

  • ids – The optional list of ids that you want to retrieve. You don’t need to specify this in the general use cases. Default is None.

Returns:

The 2-d list contains a list of passage ids that retrieved from vectordb and 2-d list of its scores. It will be a length of queries. And each element has a length of top_k.

autorag.nodes.retrieval.vectordb.vectordb_ingest(collection: Collection, corpus_data: DataFrame, embedding_model: BaseEmbedding, embedding_batch: int = 128)[source]

Ingest given corpus data to the chromadb collection. It truncates corpus content when the embedding model is OpenAIEmbedding to the 8191 tokens. Plus, when the corpus content is empty (whitespace), it will be ignored. And if there is a document id that already exists in the collection, it will be ignored.

Parameters:
  • collection – Chromadb collection instance to ingest.

  • corpus_data – The corpus data that contains doc_id and contents columns.

  • embedding_model – An embedding model instance that will be used to embed queries.

  • embedding_batch – The number of chunks that will be processed in parallel.

async autorag.nodes.retrieval.vectordb.vectordb_pure(query_embeddings: List[List[float]], top_k: int, collection: Collection) Tuple[List[str], List[float]][source]

Async VectorDB retrieval function. Its usage is for async retrieval of vector_db row by row.

Parameters:
  • query_embeddings – A list of query embeddings.

  • top_k – The number of passages to be retrieved.

  • collection – A chroma collection instance that will be used to retrieve passages.

Returns:

The tuple contains a list of passage ids that are retrieved from vectordb and a list of its scores.

Module contents