Available Metrics

When creating an evaluation run, these four metrics are available out of the box:

  • Bleu - Measure quality of translation

  • Rouge - Measure quality of summary or translation

  • Meteor - Measure quality of translation using semantic matching

  • Cosine Similarity - Assess similarity by measuring their distance in vector space

The available fields to compare for the metrics are defined by the schema of the dataset. For example, summarization datasets will have document, summary, expected_summary as choices for comparison.

Bleu

Library: nltk.sentence_bleu

Non-configurable parameters:

  • weights - 0.25 for all n-grams

Rouge

Library:rouge_score.rouge_scorer

Configurable parameters:

  • score_types: List[str] - defines which rouge-n metrics will be outputted. Defaults to ["rouge1", "rouge2", "rougeL"]

Meteor

Library: nltk.translate.meteor_score

Non-configurable parameters:

  • stemmer -PorterStemmer

  • wordnet -nltk.corpus.wordnet

  • alpha=0.9, beta=3.0, gamma=0.5

Cosine Similarity

Library: sklearn.metrics.pairwise.cosine_similarity

Non-configurable parameters:

  • embedding model - sentence-transformers/all-MiniLM-L12-v2