Available Metrics

When creating an evaluation run, these five metrics are available out of the box:

  • Bleu - Measure quality of translation

  • Rouge - Measure quality of summary or translation

  • Meteor - Measure quality of translation using semantic matching

  • Cosine Similarity - Assess similarity by measuring their distance in vector space

  • F1 Score - Measure token-level precision and recall

The available fields to compare for the metrics are defined by the schema of the dataset. For example, summarization datasets will have document, summary, expected_summary as choices for comparison.

Bleu

Library: nltk.sentence_bleu

Non-configurable parameters:

  • weights - 0.25 for all n-grams

  • tokenizer - nltk.tokenize.word_tokenize

Rouge

Library:rouge_score.rouge_scorer

Configurable parameters:

  • score_types: List[str] - defines which rouge-n metrics will be outputted. Defaults to ["rouge1", "rouge2", "rougeL"]

Meteor

Library: nltk.translate.meteor_score

Non-configurable parameters:

  • stemmer -PorterStemmer

  • wordnet -nltk.corpus.wordnet

  • alpha=0.9, beta=3.0, gamma=0.5

Cosine Similarity

Library: sklearn.metrics.pairwise.cosine_similarity

Non-configurable parameters:

  • embedding model - sentence-transformers/all-MiniLM-L12-v2

F1 Score

Matching algorithm: tokenize the case-insensitive ground truth and predicted answer, then do exact matching without considering the order of the tokens.

Non-configurable parameters:

  • tokenizer - nltk.tokenize.word_tokenize