When creating an evaluation run, these five metrics are available out of the box:
Bleu - Measure quality of translation
Rouge - Measure quality of summary or translation
Meteor - Measure quality of translation using semantic matching
Cosine Similarity - Assess similarity by measuring their distance in vector space
F1 Score - Measure token-level precision and recall
The available fields to compare for the metrics are defined by the schema of the dataset. For example, summarization datasets will have document, summary, expected_summary as choices for comparison.
Matching algorithm: tokenize the case-insensitive ground truth and predicted answer, then do exact matching without considering the order of the tokens.Non-configurable parameters: