Evaluation Metrics
Quantify the performance of an evaluation.
Available Metrics
When creating an evaluation run, these five metrics are available out of the box:
-
Bleu - Measure quality of translation
-
Rouge - Measure quality of summary or translation
-
Meteor - Measure quality of translation using semantic matching
-
Cosine Similarity - Assess similarity by measuring their distance in vector space
-
F1 Score - Measure token-level precision and recall
The available fields to compare for the metrics are defined by the schema of the dataset. For example, summarization datasets will have document
, summary
, expected_summary
as choices for comparison.
Bleu
Library: nltk.sentence_bleu
Non-configurable parameters:
-
weights -
0.25
for all n-grams -
tokenizer -
nltk.tokenize.word_tokenize
Rouge
Library:rouge_score.rouge_scorer
Configurable parameters:
score_types: List[str]
- defines which rouge-n metrics will be outputted. Defaults to["rouge1", "rouge2", "rougeL"]
Meteor
Library: nltk.translate.meteor_score
Non-configurable parameters:
-
stemmer -
PorterStemmer
-
wordnet -
nltk.corpus.wordnet
-
alpha=0.9
,beta=3.0
,gamma=0.5
Cosine Similarity
Library: sklearn.metrics.pairwise.cosine_similarity
Non-configurable parameters:
- embedding model -
sentence-transformers/all-MiniLM-L12-v2
F1 Score
Matching algorithm: tokenize the case-insensitive ground truth and predicted answer, then do exact matching without considering the order of the tokens.
Non-configurable parameters:
- tokenizer -
nltk.tokenize.word_tokenize