Retrieval Token Metrics

0. Retrieval token metric in AutoRAG

Currently, in AutoRAG, the Retrieval token metric is only used by the Passage Compressor Node. It measures performance by comparing the compressed passage to Answer_gt.

When comparing Passage and Answer gt, the comparison is made on a per-token basis, which you can see by looking at the example

✅Basic Example

answer gt = ['Do you want to buy some?']

result = ['Do you want to buy some?', 'I want to buy some', 'I want to buy some water']

First, let’s break up gt and result into tokens

  • GT is a total of 6 tokens ['do', 'you', 'want', 'to', 'buy', 'some']

  • The number of tokens in the result is 6, 5, and 6, respectively ['do', 'you', 'want', 'to', 'buy', 'some'], ['I', 'want', 'to', 'buy', 'some'], ['I', 'want', 'to', 'buy', 'some', 'water']

Next, let’s look at the number of overlapping tokens in gt and result

  • The first is that all six tokens overlap with GT, so the number of overlapping tokens is 6.

  • The second has four tokens overlapping except for the ‘I.’

  • The third has four tokens overlapping except for ‘I’ and ‘water.’

1. Token Precision

📌Definition

Number of overlapping tokens / token length in result

✅Apply Basic Example

First, 6/6 = 1

Second, 4/5 = 0.8

Third, 4/6 = 2/3 = 0.666…

Therefore, token precision is 0.822..., the average of the three.

2. Token Recall

📌Definition

Number of overlapping tokens / token length in gt

✅Apply Basic Example

First, 6/6 = 1

Second, 4/6 = 0.666…

Third, 4/6 = 2/3 = 0.666…

Therefore, Token Recall is 0.777…, the average of three

3. Token F1

📌Definition

F1 score is the harmonic mean of Precision and Recall.

f1_score

✅Apply Basic Example

Precision = 0.822…

Recall = 0.777…

Therefore, F1 Score = 0.797979…