autorag.evaluation.metric package¶
Submodules¶
autorag.evaluation.metric.deepeval_prompt module¶
autorag.evaluation.metric.generation module¶
- async autorag.evaluation.metric.generation.async_g_eval(generation_gt: List[str], pred: str, metrics: List[str] | None = None, model: str = 'gpt-4-0125-preview') float [source]¶
- autorag.evaluation.metric.generation.bert_score(metric_inputs: List[MetricInput], lang: str = 'en', batch: int = 128, n_threads: int = 4) List[float] [source]¶
- autorag.evaluation.metric.generation.bleu(metric_inputs: List[MetricInput], tokenize: str | None = None, smooth_method: str = 'exp', smooth_value: float | None = None, max_ngram_order: int = 4, trg_lang: str = '', effective_order: bool = True, **kwargs) List[float] [source]¶
Computes the BLEU metric given pred and ground-truth.
- Parameters:
metric_inputs – A list of MetricInput schema (Required Field -> “generation_gt”, “generated_texts”)
tokenize – The tokenizer to use. If None, defaults to language-specific tokenizers with ‘13a’ as the fallback default. check #https://github.com/mjpost/sacrebleu/blob/master/sacrebleu/metrics/bleu.py
smooth_method – The smoothing method to use (‘floor’, ‘add-k’, ‘exp’ or ‘none’).
smooth_value – The smoothing value for floor and add-k methods. None falls back to default value.
max_ngram_order – If given, it overrides the maximum n-gram order (default: 4) when computing precisions.
trg_lang – An optional language code to raise potential tokenizer warnings.
effective_order – If True, stop including n-gram orders for which precision is 0. This should be
True, if sentence-level BLEU will be computed.
- autorag.evaluation.metric.generation.deepeval_faithfulness(metric_inputs: List[MetricInput], generator_module_type: str = 'openai_llm', lang: str = 'en', llm: str = 'gpt-4o-2024-08-06', batch: int = 16, **kwargs) List[float] [source]¶
Compute deepeval faithfulness metric. Its default model is gpt-4o-2024-08-06. Since it uses OpenAI model, please be aware of the expensive cost.
- Parameters:
metric_inputs – The list of MetricInput schema (Required Field -> “generation_gt”, “generated_texts”)
generator_module_type – Generator module type.
- The default is “openai_llm”.
You can use like “llama_index_llm” or “vllm”.
- Parameters:
lang – The prompt language that you want to use.
“en”, “ko” and “ja” are supported. Korean prompt is not officially supported by DeepEval, but it can be translated by AutoRAG developers.
Default is “en”.
- Parameters:
llm – The model name to use for generation. Or llm if using llama_index_llm. The default is “gpt-4o-2024-08-06”.
batch – The batch size for processing. Default is 16.
kwargs – The extra parameters for initializing the llm instance.
- Returns:
The metric scores.
- autorag.evaluation.metric.generation.g_eval(metric_inputs: List[MetricInput], metrics: List[str] | None = None, model: str = 'gpt-4-0125-preview', batch_size: int = 8) List[float] [source]¶
Calculate G-Eval score. G-eval is a metric that uses high-performance LLM model to evaluate generation performance. It evaluates the generation result by coherence, consistency, fluency, and relevance. It uses only ‘openai’ model, and we recommend to use gpt-4 for evaluation accuracy.
- Parameters:
metric_inputs – A list of MetricInput schema (Required Field -> “generation_gt”, “generated_texts”)
metrics – A list of metrics to use for evaluation. Default is all metrics, which is [‘coherence’, ‘consistency’, ‘fluency’, ‘relevance’].
model – OpenAI model name. Default is ‘gpt-4-0125-preview’.
batch_size – The batch size for processing. Default is 8.
- Returns:
G-Eval score.
- autorag.evaluation.metric.generation.huggingface_evaluate(instance, key: str, metric_inputs: List[MetricInput], **kwargs) List[float] [source]¶
Compute huggingface evaluate metric.
- Parameters:
instance – The instance of huggingface evaluates metric.
key – The key to retrieve result score from huggingface evaluate result.
metric_inputs – A list of MetricInput schema
kwargs – The additional arguments for metric function.
- Returns:
The list of scores.
- autorag.evaluation.metric.generation.make_generator_instance(generator_module_type: str, llm: str, **kwargs)[source]¶
- autorag.evaluation.metric.generation.meteor(metric_inputs: List[MetricInput], alpha: float = 0.9, beta: float = 3.0, gamma: float = 0.5) List[float] [source]¶
Compute meteor score for generation.
- Parameters:
metric_inputs – A list of MetricInput schema (Required Field -> “generation_gt”, “generated_texts”)
alpha – Parameter for controlling relative weights of precision and recall. Default is 0.9.
beta – Parameter for controlling shape of penalty as a function of as a function of fragmentation. Default is 3.0.
gamma – Relative weight assigned to fragmentation penalty. Default is 0.5.
- Returns:
A list of computed metric scores.
- autorag.evaluation.metric.generation.rouge(metric_inputs: List[MetricInput], rouge_type: str | None = 'rougeL', use_stemmer: bool = False, split_summaries: bool = False, batch: int = 4) List[float] [source]¶
Compute rouge score for generation.
- param metric_inputs:
A list of MetricInput schema (Required Field -> “generation_gt”, “generated_texts”)
- param rouge_type:
A rouge type to use for evaluation. Default is ‘RougeL’. Choose between rouge1, rouge2, rougeL, and rougeLSum. - rouge1: unigram (1-gram) based scoring. - rouge2: bigram (2-gram) based scoring. - rougeL: Longest Common Subsequence based scoring. - rougeLSum: splits text using “
- “
- param use_stemmer:
Bool indicating whether Porter stemmer should be used to strip word suffixes to improve matching. This arg is used in the DefaultTokenizer, but other tokenizers might or might not choose to use this. Default is False.
- param split_summaries:
Whether to add newlines between sentences for rougeLsum. Default is False.
- param batch:
The batch size for processing. Default is your cpu count.
- return:
A list of computed metric scores.
- autorag.evaluation.metric.generation.sem_score(metric_inputs: List[MetricInput], embedding_model: BaseEmbedding | None = None, batch: int = 128) List[float] [source]¶
Compute sem score between generation gt and pred with cosine similarity.
- Parameters:
metric_inputs – A list of MetricInput schema (Required Field -> “generation_gt”, “generated_texts”)
embedding_model – Embedding model to use for compute cosine similarity. Default is all-mpnet-base-v2 embedding model. The paper used this embedding model.
batch – The batch size for processing. Default is 128
- Returns:
A list of computed metric scores.
autorag.evaluation.metric.retrieval module¶
- autorag.evaluation.metric.retrieval.retrieval_f1(metric_input: MetricInput)[source]¶
Compute f1 score for retrieval.
- Parameters:
metric_input – The MetricInput schema for AutoRAG metric.
- Returns:
The f1 score.
- autorag.evaluation.metric.retrieval.retrieval_map(metric_input: MetricInput) float [source]¶
Mean Average Precision (MAP) is the mean of Average Precision (AP) for all queries.
- autorag.evaluation.metric.retrieval.retrieval_mrr(metric_input: MetricInput) float [source]¶
Reciprocal Rank (RR) is the reciprocal of the rank of the first relevant item. Mean of RR in whole queries is MRR.
- autorag.evaluation.metric.retrieval.retrieval_ndcg(metric_input: MetricInput) float [source]¶
- autorag.evaluation.metric.retrieval.retrieval_precision(metric_input: MetricInput) float [source]¶
- autorag.evaluation.metric.retrieval.retrieval_recall(metric_input: MetricInput) float [source]¶
autorag.evaluation.metric.retrieval_contents module¶
This file contains the retrieval contents metric, which means calculate the metric based on the contents of the retrieved items.
- autorag.evaluation.metric.retrieval_contents.retrieval_token_f1(metric_input: MetricInput)[source]¶
- autorag.evaluation.metric.retrieval_contents.retrieval_token_precision(metric_input: MetricInput)[source]¶
- autorag.evaluation.metric.retrieval_contents.retrieval_token_recall(metric_input: MetricInput)[source]¶