Evaluation Metrics
Quantify the performance of an evaluation.
Available Metrics
When creating an evaluation run, these four metrics are available out of the box:
-
Bleu - Measure quality of translation
-
Rouge - Measure quality of summary or translation
-
Meteor - Measure quality of translation using semantic matching
-
Cosine Similarity - Assess similarity by measuring their distance in vector space
The available fields to compare for the metrics are defined by the schema of the dataset. For example, summarization datasets will have document
, summary
, expected_summary
as choices for comparison.
Bleu
Library: nltk.sentence_bleu
Non-configurable parameters:
- weights -
0.25
for all n-grams
Rouge
Library:rouge_score.rouge_scorer
Configurable parameters:
score_types: List[str]
- defines which rouge-n metrics will be outputted. Defaults to["rouge1", "rouge2", "rougeL"]
Meteor
Library: nltk.translate.meteor_score
Non-configurable parameters:
-
stemmer -
PorterStemmer
-
wordnet -
nltk.corpus.wordnet
-
alpha=0.9
,beta=3.0
,gamma=0.5
Cosine Similarity
Library: sklearn.metrics.pairwise.cosine_similarity
Non-configurable parameters:
- embedding model -
sentence-transformers/all-MiniLM-L12-v2