Available Metrics

When creating an evaluation run, these five metrics are available out of the box:
  • Bleu - Measure quality of translation
  • Rouge - Measure quality of summary or translation
  • Meteor - Measure quality of translation using semantic matching
  • Cosine Similarity - Assess similarity by measuring their distance in vector space
  • F1 Score - Measure token-level precision and recall
The available fields to compare for the metrics are defined by the schema of the dataset. For example, summarization datasets will have document, summary, expected_summary as choices for comparison.

Bleu

Library: nltk.sentence_bleu Non-configurable parameters:
  • weights - 0.25 for all n-grams
  • tokenizer - nltk.tokenize.word_tokenize

Rouge

Library:rouge_score.rouge_scorer Configurable parameters:
  • score_types: List[str] - defines which rouge-n metrics will be outputted. Defaults to ["rouge1", "rouge2", "rougeL"]

Meteor

Library: nltk.translate.meteor_score Non-configurable parameters:
  • stemmer -PorterStemmer
  • wordnet -nltk.corpus.wordnet
  • alpha=0.9, beta=3.0, gamma=0.5

Cosine Similarity

Library: sklearn.metrics.pairwise.cosine_similarity Non-configurable parameters:
  • embedding model - sentence-transformers/all-MiniLM-L12-v2

F1 Score

Matching algorithm: tokenize the case-insensitive ground truth and predicted answer, then do exact matching without considering the order of the tokens. Non-configurable parameters:
  • tokenizer - nltk.tokenize.word_tokenize