autorag.evaluation.metric package

Submodules

autorag.evaluation.metric.generation module

async autorag.evaluation.metric.generation.async_g_eval(generation_gt: List[str], pred: str, metrics: List[str] | None = None, model: str = 'gpt-4-0125-preview') float[source]
autorag.evaluation.metric.generation.bert_score(metric_inputs: List[MetricInput], lang: str = 'en', batch: int = 128, n_threads: int = 4) List[float][source]
autorag.evaluation.metric.generation.bleu(metric_inputs: List[MetricInput], tokenize: str | None = None, smooth_method: str = 'exp', smooth_value: float | None = None, max_ngram_order: int = 4, trg_lang: str = '', **kwargs) List[float][source]

Computes the BLEU metric given pred and ground-truth.

Parameters:
  • metric_inputs – A list of MetricInput schema (Required Field -> “generation_gt”, “generated_texts”)

  • tokenize – The tokenizer to use. If None, defaults to language-specific tokenizers with ‘13a’ as the fallback default. check #https://github.com/mjpost/sacrebleu/blob/master/sacrebleu/metrics/bleu.py

  • smooth_method – The smoothing method to use (‘floor’, ‘add-k’, ‘exp’ or ‘none’).

  • smooth_value – The smoothing value for floor and add-k methods. None falls back to default value.

  • max_ngram_order – If given, it overrides the maximum n-gram order (default: 4) when computing precisions.

  • trg_lang – An optional language code to raise potential tokenizer warnings.

autorag.evaluation.metric.generation.g_eval(metric_inputs: List[MetricInput], metrics: List[str] | None = None, model: str = 'gpt-4-0125-preview', batch_size: int = 8) List[float][source]

Calculate G-Eval score. G-eval is a metric that uses high-performance LLM model to evaluate generation performance. It evaluates the generation result by coherence, consistency, fluency, and relevance. It uses only ‘openai’ model, and we recommend to use gpt-4 for evaluation accuracy.

Parameters:
  • metric_inputs – A list of MetricInput schema (Required Field -> “generation_gt”, “generated_texts”)

  • metrics – A list of metrics to use for evaluation. Default is all metrics, which is [‘coherence’, ‘consistency’, ‘fluency’, ‘relevance’].

  • model – OpenAI model name. Default is ‘gpt-4-0125-preview’.

  • batch_size – The batch size for processing. Default is 8.

Returns:

G-Eval score.

autorag.evaluation.metric.generation.huggingface_evaluate(instance, key: str, metric_inputs: List[MetricInput], **kwargs) List[float][source]

Compute huggingface evaluate metric.

Parameters:
  • instance – The instance of huggingface evaluates metric.

  • key – The key to retrieve result score from huggingface evaluate result.

  • metric_inputs – A list of MetricInput schema

  • kwargs – The additional arguments for metric function.

Returns:

The list of scores.

autorag.evaluation.metric.generation.meteor(metric_inputs: List[MetricInput], alpha: float = 0.9, beta: float = 3.0, gamma: float = 0.5) List[float][source]

Compute meteor score for generation.

Parameters:
  • metric_inputs – A list of MetricInput schema (Required Field -> “generation_gt”, “generated_texts”)

  • alpha – Parameter for controlling relative weights of precision and recall. Default is 0.9.

  • beta – Parameter for controlling shape of penalty as a function of as a function of fragmentation. Default is 3.0.

  • gamma – Relative weight assigned to fragmentation penalty. Default is 0.5.

Returns:

A list of computed metric scores.

autorag.evaluation.metric.generation.rouge(metric_inputs: List[MetricInput], rouge_type: str | None = 'rougeL', use_stemmer: bool = False, split_summaries: bool = False, batch: int = 4) List[float][source]

Compute rouge score for generation.

param metric_inputs:

A list of MetricInput schema (Required Field -> “generation_gt”, “generated_texts”)

param rouge_type:

A rouge type to use for evaluation. Default is ‘RougeL’. Choose between rouge1, rouge2, rougeL, and rougeLSum. - rouge1: unigram (1-gram) based scoring. - rouge2: bigram (2-gram) based scoring. - rougeL: Longest Common Subsequence based scoring. - rougeLSum: splits text using “

param use_stemmer:

Bool indicating whether Porter stemmer should be used to strip word suffixes to improve matching. This arg is used in the DefaultTokenizer, but other tokenizers might or might not choose to use this. Default is False.

param split_summaries:

Whether to add newlines between sentences for rougeLsum. Default is False.

param batch:

The batch size for processing. Default is your cpu count.

return:

A list of computed metric scores.

autorag.evaluation.metric.generation.sem_score(metric_inputs: List[MetricInput], embedding_model: BaseEmbedding | None = None, batch: int = 128) List[float][source]

Compute sem score between generation gt and pred with cosine similarity.

Parameters:
  • metric_inputs – A list of MetricInput schema (Required Field -> “generation_gt”, “generated_texts”)

  • embedding_model – Embedding model to use for compute cosine similarity. Default is all-mpnet-base-v2 embedding model. The paper used this embedding model.

  • batch – The batch size for processing. Default is 128

Returns:

A list of computed metric scores.

autorag.evaluation.metric.retrieval module

autorag.evaluation.metric.retrieval.retrieval_f1(metric_input: MetricInput)[source]

Compute f1 score for retrieval.

Parameters:

metric_input – The MetricInput schema for AutoRAG metric.

Returns:

The f1 score.

autorag.evaluation.metric.retrieval.retrieval_map(metric_input: MetricInput) float[source]

Mean Average Precision (MAP) is the mean of Average Precision (AP) for all queries.

autorag.evaluation.metric.retrieval.retrieval_mrr(metric_input: MetricInput) float[source]

Reciprocal Rank (RR) is the reciprocal of the rank of the first relevant item. Mean of RR in whole queries is MRR.

autorag.evaluation.metric.retrieval.retrieval_ndcg(metric_input: MetricInput) float[source]
autorag.evaluation.metric.retrieval.retrieval_precision(metric_input: MetricInput) float[source]
autorag.evaluation.metric.retrieval.retrieval_recall(metric_input: MetricInput) float[source]

autorag.evaluation.metric.retrieval_contents module

This file contains the retrieval contents metric, which means calculate the metric based on the contents of the retrieved items.

autorag.evaluation.metric.retrieval_contents.retrieval_token_f1(metric_input: MetricInput)[source]
autorag.evaluation.metric.retrieval_contents.retrieval_token_precision(metric_input: MetricInput)[source]
autorag.evaluation.metric.retrieval_contents.retrieval_token_recall(metric_input: MetricInput)[source]
autorag.evaluation.metric.retrieval_contents.single_token_f1(ground_truth: str, prediction: str)[source]

autorag.evaluation.metric.util module

autorag.evaluation.metric.util.autorag_metric(fields_to_check: List[str])[source]
autorag.evaluation.metric.util.autorag_metric_loop(fields_to_check: List[str])[source]
autorag.evaluation.metric.util.calculate_cosine_similarity(a, b)[source]

Module contents