Generation Metrics

1. Bleu

BLEU (Bilingual Evaluation Understudy)

📌Definition

n-gram base

The extent to which words in the generated sentence are included in the reference sentence → By AutoRAG, the extent to which words in the LLM generated result are included in Answer gt

2. Rouge

Rouge (Recall-Oriented Understudy for Gisting Evaluation)

📌Definition

n-gram base

The extent to which words from the reference sentences are included in the generated sentence → By AutoRAG, the extent to which words in Answer gt are included in the LLM generated result

3. METEOR

METEOR (Metric for Evaluation of Translation with Explicit ORdering)

📌Definition

Here is the paper link that introduced METEOR

The metric is based on theharmonic meanof unigramprecision and recall, with recall weighted higher than precision.

It also has several features that are not found in other metrics, such as stemming and synonym matching, along with the standard exact word matching.

The metric was designed to fix some of the problems found in the more popularBLEUmetric, and also produce good correlation with human judgment at the sentence or segment level.

This differs from the BLEU metric in that BLEU seeks correlation at the corpus level.

4. Sem Score

📌Definition

Here is the paper link that introduced Sem Score.

The concept of SemScore is quite simple.

It measures semantic similarity between ground truth and the model’s generation using an embedding model.

You can find more detailed information at here

5. G-Eval

📌 Definition

Here is the link that introduced G-Eval

G-Eval, a framework of using large language models with chain-of-thoughts (CoT) and a form-filling paradigm, to assess the quality of NLG outputs.

Paper said that G-Eval with GPT-4 as the backbone model achieves a Spearman correlation of 0.514 with human on a summarization task, outperforming all previous methods by a large margin.

So, in AutoRAG, we use G-Eval with GPT-4


5-1. Coherence

  • Evaluate whether the answer is logically consistent and flows naturally.

  • Evaluate the connections between sentences and how they fit into the overall context.


5-2. Consistency

  • Evaluate whether the answer is consistent with and does not contradict the question asked or the information presented.

  • An answer should provide information that does not conflict with the requirements of the question or the data presented.


5-3. Fluency

  • Evaluate answers for fluency


5-4. Relevance

  • Evaluate how well the answer meets the question’s requirements

  • A highly relevant answer should be directly related to the question’s core topic or keyword.

❗How to use specific G-Eval metrics

You can use specific G-Eval metrics to use metrics parameter.

Here is an example YAML file that uses G-Eval consistency metric.

- metric_name: g_eval
  metrics: [ consistency ]

6. Bert Score

📌Definition

Here is the link that introduced BERT Score.

A metric that measures the similarity between two sentences using BERT’s Contextual Embedding.

Get the Contextual Embedding value of Answer gt and LLM generated result with BERT, evaluate the similarity with Cosine Similarity for each token-pair, and weight each token with IDF.