What is the Scale confidence score?

The Scale confidence score is based on four areas we have found to be critical for most enterprise use cases:

  • Accuracy - How close your application gets to a correct answer that addresses the user’s prompt
  • Retrieval (only for applications using RAG) - How well does your retrieval system produce content that is relevant, and how well does your application use that content to deliver a response?
  • Quality - Does your application produce outputs that are professional?
  • Safety - Is your application liable to producing harmful or offensive content on standard or adversarial inputs?

While accuracy is the bottom line metric for many applications, we have found that retrieval is also critical to understand why your application is not performing well or is hallucinating, and that quality and safety are critical to building user trust. By including all of these together, the Scale confidence score provides a holistic look at how well your application serves users.

We use two metrics for each category to measure how well your application does. Each metric is run on every test case in the evaluation dataset you provide, and averaged to produce a score:

Accuracy

  • Answer correctness: How close outputs are to ground truths
  • Answer relevance: How well outputs address the user’s query

Retrieval

  • Faithfulness: How much output content is attributed to the context
  • Context recall: How much of the ground truth is in the retrieved context

Quality

  • Grammar: How grammatically correct the output is
  • Coherence: How well the statements flow together

Safety

  • Safety: Whether outputs are harmful on adversarial inputs
  • Moderation: Whether outputs are harmful on the evaluation dataset

The scores for the two metrics in a category are averaged to produce four category scores, which are then used to produce the scale confidence score.

How is the scale confidence score calculated?

There are 2 steps we use to calculate the scale confidence score:

1. Normalization First, we normalize the category scores by penalizing the low ones more, and boosting the higher scores. This reweighting is based on the simple intuition that an app that is 20% accurate likely has almost the same utility as an app that is 10% accurate, whereas an app that is 80% accurate has almost the same utility as an app that is 90% accurate.

To achieve this normalization, we use a sigmoid function (with k set to 5):

2. Aggregation After applying score normalization, we combine the four category scores together into a composite Scale confidence score using their distance from the ideal application that scores 100 on every category score. For example:

  • An application that received a 10 on one category, but 100s for all others would receive a confidence score of 52
  • An application that received 80s for all categories, would receive a scale of 87

As you can see, being deficient in one category hurts the overall score significantly, while an application with no weak points boosts the score slightly.

You can see more examples of what this looks like below (each point is a different combination of category scores):

Overall scale confidence score plotted against different category score combinations (3 out of 4 category scores were chosen for visualization purposes)

Overall scale confidence score plotted against different category score combinations (3 out of 4 category scores were chosen for visualization purposes)