
Version 25.2.1


Bug Fixes

Text Overflow Issues

The “Inline LaTeX” toggle on the test case results page now works as expected. Additionally, test cases containing long words break in-line instead of overflowing in the test case results page and annotation UI.

Table CSV Export

The task dashboard and contributor metrics tables now export properly formatted headers. Additionally, the “Time Spent” columns are now exported in HH:MM:SS format instead of seconds.

Application Variant Creation

External variants are now filtered out from selection in the “Start from existing variant” dropdown when creating a new variant from the application page. This option is completely hidden if all variants are external.

Annotation UI Configuration

Users can now configure the annotation UI for flexible datasets without selecting “Contributor” as an evaluator.

Empty Columns

Columns that previously had an empty title in the “Customize Columns” popovers now display the actual column header.


Version 25.2.0


New Features

Flexible Evaluations Through the UI

Users can now configure and run flexible evaluations through the UI. This also means that all evaluations can be done through the UI. See below for how to run a flexible evaluation through the UI.

Upload Flexible Datasets and Outputs

First, navigate to the Evaluation Dataset page in order to create a dataset.

Next, use the modal to upload a flexible dataset. Be sure to name the data set and select the type as Flexible.

Next, navigate to the Application Variant page to upload outputs for the dataset. Select Upload Outputs.

Choose the flexible dataset that was uploaded and upload the corresponding outputs for that dataset. The modal will automatically show the matching columns between your flexible dataset and your outputs. You need to have at least one matching column header in order to upload outputs through the UI.

The outputs will show up here.

Configure the Annotation Configuration and Run Evaluation

After uploading the dataset and outputs, users can now run an evaluation through the UI as well for flexible datasets.

Navigate to the Evaluations UI and select the application, dataset, and question set. Note that this step is mandatory for flexible evaluation dataset schemas.

If you choose to run a Contributor evaluation, you can now configure the layout of what the human annotator will see when they run the evaluation.

Select Multiple Metrics Calculated with Different Columns from the Dataset

Users can now select several different of the same evaluation metrics to compare. For example, users can select multiple Bleu scores for each evaluation run and indicate which data columns will be used to compute the Bleu score.

Minor Features

All columns in dataset schema is available on Task Dashboard as Hidden Column

Users can now search for task related data and filter the rows on the task dashboard based on the data in the test case of the relevant dataset. This simplifies the process of finding test cases based on their content in the annotation task dashboard.

Task Queue and Contributor Metrics Can be Exported as CSV

Users can now export Task Queue and Contributor Metrics Can be Exported as CSV.

Customizable Filters For Columns in Contributor Metrics

Contributor Metrics Table now has customizable filters for columns in the contributor task dashboard.

Bug Fixes

Can now filter by “Needs Review” column in the Task Dashboard

Fixed an issue where the needs review column filter in the task dashboard was not working. Users can now filter by that column.

Can now clear numeric values in annotation view

Fixed a bug where numerical question types could not be cleared by annotators.

Fixed Test Case Results Page for Hybrid Evals

Fixed an issue where Autoevaluation results were not showing up on the test case results view for Hybrid Evals if the Manual evaluation was not yet done.

Test Case Results View Works with Latex

Latex inline formatting shows up for test case results view now.

Reasoning and Prompt Hidden for Human Evaluations

Since reasoning and prompt only apply to autoevaluations, they only show up for autoevaluations now. Previously was showing up as empty for Human Evaluations.


Version 25.1.3


Bug Fix

Flexible Evaluations

Variables now properly show up in both the auto evaluation prompt editor and metrics configuration drop down selectors for fully flexible datasets in ReBAC enabled deployments.


Version 25.1.2


Bug Fix

PII Attestation

Patching a state save issue with the PII attestation confirmation modal.


Version 25.1.1


New Features

PII Attestation

Deployments can enabled an acknowledgement that requires users to confirm that the data they are uploading does not contain PII or MNPI.


Version 25.1.0


New Features

Select Metrics to Calculate When Configuring Evaluation Runs

Users can now select preconfigured metrics to calculate during an evaluation in addition to the questions answered on the rubric.

  1. Currently, users can select from Bleu, Rouge, Meteor, and Cosine Similarity.

  2. After running an evaluation, the selected metrics will show up in the metrics tab for evaluation results.

  3. The selected metrics will also show up in the table.

Expanded Auto Evaluation Capabilities

Auto Evaluation Support for all Flexible Evaluations

Users can now run auto evaluations for any flexible dataset. The prompt template is fully customizable and allows users to insert any variables defined in the dataset. Variables for the dataset will show up on the side of the prompt template.

Modify Batch Job Size and Hyperparameters

Users can now modify the batch job size and hyperparameters when configuring autoevaluations.

Other Improvements to the evaluations workflow

Tooltip explaining human versus LLM evaluators

Added a tooltip that shows the difference between human and LLM evaluators.

Show tooltip for search matches in a hidden column

When a user searches for a keyword that has a match in a hidden column, a tooltip will let the user know that the search result is hidden.

Show Modal with LLM Used and Prompt

Clicking on LLM for completed auto evaluation results will open a modal that shows the LLM used and the prompt used for the autoevaluation.

Maintain Consistent use of Test Case ID

Maintain consistent use of “test case ID” across annotation task dashboard and evaluation results page. IDs can easily be used to compare and match tasks.

Other Minor UI Fixes

Ability to edit application name, application description, variant name and description on the UI

Users are able to edit the variant name and description in the UI by clicking on the edit button.

Tasking Improvements

Bulk Assign Users to Tasks

The platform now has the ability to bulk assign tasks to labelers.

Filter Task Dashboard Based On Responses to Evaluation Questions

Users now have the ability to see the responses to evaluation questions on the task dashboard. These columns default as hidden, but can be shown and used to filter the table.

Bug Fixes

Wrong timestamps for “Updated At” in task dashboard

Fixed an issue that populated “updated at” for all tasks of an auto evaluation with the time the overall auto-evaluation run finished. Now, each task shows the correct time it was last updated during an auto-evaluation run.

Contributor metrics aggregation shows correct amount of time for auto evaluation to be run

Fixed the issue that the time taken by the LLM evaluator was wrongly aggregated. As the LLM evaluator executes tasks in parallel, the total time for the evaluation run is equal to the actual runtime of the model, instead of the sum of the individual tasks (which is the case for contributor evaluators)

Show number of rows in the evaluation set

Fixed an issue where the number of test cases in a given dataset was showing 0 for all datasets

Consistent sorting algorithm on tables

The content of tables are now sorted by special characters first and alphabetically second.

Table filters are displayed properly (overflow issue fix)

Fixed an overflow issue in the CSS so that table filters now show up on top of the table properly.

Custom Prompts are Properly Executed

Fixed a caching issue in our custom prompt store for autoevaluations. Now all custom prompts are saved upon configuration and executed for autoevaluations.


Version 24.12.0


New Features

Auto Evaluation Support for Summarization and Translation Use Cases

Auto Evaluations are now available for summarization and translation use cases on the platform! Users can kick off an auto evaluation for summarization and translation evaluations through either the UI and SDK. Users can modify the prompt for autoevaluations, and then see the prompt and reasoning in the results page for the autoevaluation.

  1. Kick off an auto evaluation.

  2. Configure the auto evaluation. Choose model and edit prompt.

  3. View Autoevaluation results in the test case results view
    View Autoevaluation results in the annotations view
  4. For more information on how to kick off Auto-evaluations, check out our documentation.

Save Progress While Doing Annotations

Users can now save their progress on a task and come back to it. The timer will only count the time they worked on the task towards the total time of the task. When the user resumes a task they have saved, they will be able to see the questions they’ve already answered. If another user picks up the task, they will not be able to see any of the original user’s progress.

Task Dashboard Improvements

  • On the task dashboard, users now have the ability to search for tasks.
  • Users can also filter by the user, task ID, date, and time for each column.
  • Users can download and export contributor metrics.
  • Users can filter the contributor metrics table by submissions, contributor, tasks fixed, time spent, and efficiency.

Application Variant Page Improvements

The Variant Description now shows up on the Application Variant Page.

Annotation UI Improvements

We launched a series of improvements to the annotations UI for our annotators.

  • Checkmark icons will now only show up for completed questions. Before, we show green vs gray checkmarks on all questions with green indicating that the question is completed. This UX is challenging for color blind users.

  • The UX for the previous and next questions has been improved. The “Previous” button will only show up if the current question is not the first question. The “Next” button will only show up if the current question is not the last question.

  • URLs in the annotation view can now be clicked on.

Alerting and Paging

We have set up a healthcheck alert in SGP Azure VPC environments to enable quick turnaround for any service downtime. Upon failure of the healthcheck, an alert rule will trigger a PagerDuty action group, which can be connected to both Scale and customer PagerDuty services.

Bug Fixes

Application Variant Shows Up After Creation

After creating an application variant, it now shows up immediately without refreshing.

Search works for Raw View in Annotations View

Search now works for Raw View for Annotations


Version 24.11.3


New Features

OpenAI MSAL support

VPC deployments can now use the MSAL confidential client application authorization flow for OpenAI proxy/gateway credentials.

Bug Fixes

Audit view state updates

Tasks completed by the current user are now auditable through the Task Dashboard without requiring a browser refresh.


Version 24.11.2


Bug Fixes

Task Dashboard state updates

Updates to the current user’s in-progress and completed tasks are now properly synced on the Task Dashboard without requiring a browser refresh.

Evaluation run comparison

The “Evaluation Comparison” page for an evaluation run no longer surfaces itself as a comparison option.


Version 24.11.1


Bug Fixes

Evaluation run table view

Evaluations that use rubrics with a “Free Text” question type now render the table view without error.

Task ID rendering

The ID rendered on the annotation view (from the “Start Labeling” flow) now properly reflects the value in the “Test Case ID” column on the task dashboard. Additionally, the truncated IDs on the task dashboard are now searchable using the fully qualified ID in browser search.


Version 24.11.0


New Features

Aggregate Metrics View

Users now have the ability to compare application performance across all variants inside the SGP platform through the Aggregate Metrics View.

Users can access this feature by navigating to an application and selecting “Show Metrics” on the actions tab above.

This view allows you to see and compare the performance of all variant runs that were run across the same dataset and rubric.

Users can also choose to view in a Table View by toggling the button from Graph to Table. In the table view, users can export the data by selecting the “Export as CSV” Button.

Assign Tasks to Specific Labelers in UI

After an evaluation is kicked-off, users with admin permissions will be able to assign tasks to contributors through the UI on the task dashboard!

Removed Unused Filter Options from Annotation Project Overview

Removed unused options for “Studio” and “Scale GP” to filter Annotation projects, which were artifacts of previous versions of the platform. This avoids user confusion.

Version Numbering

SGP will now display the software version in the platform UI. You can see the version by opening the bottom left context menu.

Bug Fixes

Claimed Tasks not refreshed

After a task is claimed it was not immediately showing up as such in the task dashboard without a refresh of the page. With this fix, the task shows as claimed immediately when clicking the button.

Task Dashboard Loading Time

The Task Dashboard previously had long loading times if more than a few hundred tasks were included in a project. After optimizing frontend queries and introducing pagination, loading times have been reduced significantly.


Version 0.9.1


New Features

1. End to end Annotation Functionality in ScaleGP

Users now have the ability to conduct human-expert evaluations entirely within Scale GP, eliminating the need to switch to Scale Studio as previously required. This capability is made possible through the implementation of a new task queuing backend and a user interface designed to facilitate the viewing and initiation of evaluation tasks. Features allowing users to prioritize annotation projects and assign annotators to those projects are scheduled for release in the near future.

To facilitate seamless annotation workflows, Scale GP offers role based access control with three user roles: Admin, Manager and Annotator. Managers and Admins will be able to interact with the entire Scale GenAI Platform and manage evaluation runs, including assigning annotation projects and priorities, while annotators (typically SMEs) can focus entirely on the evaluation tasks

How to Use

To get started, users can switch to the new “Annotation” tab on the sidebar and view currently active annotation projects that are ready for labeling. The view also allows users to view legacy annotation projects which are executed using the external Scale Studio queuing system

2. Knowledge Base Management

Users can now create and manage their Scale GP knowledge bases via the UI. This makes it significantly easier to monitor the assets uploaded in a knowledge base and to track longer uploads as they are progressing. It also enables users to quickly find and use a relevant knowledge base using the AI playground or when building a custom application with our SDK

How to Use

To get started navigate to the “Knowledge Bases” tab on the left to see your existing resources. If there are none, you can create a new knowledge base via the API or directly in the UI, by uploading assets from your local device. We will soon follow up with automated ingest from data sources like S3 and Azure Blob storage

AI Playground

Users can now seamlessly build sample GenAI applications in Scale GP without writing a single line of code. The AI playground provides application templates for typical GenAI use cases such as Retrieval Augmented Generation (RAG), which can be configured in a no-code setup flow, selecting any of the knowledge bases and models that are available on the platform. After completing a setup, users can easily export the sample application for further customization or deployment.

How to Use

To build your first application, switch to the “Playground” tab and begin by selecting a template to configure, then follow the instructions and experiment with components until the desired outcome is achieved.

Self-Service Model Fine Tuning

Restructured the annotation view to show generated and expected output, as well as context side by side, which helps annotators compare results to ground truth more efficiently. We will also soon be adding functionality to better navigate retrieved context chunks and a simplified flow to copy sections from context into the evaluation form.

How to Use

To get started with fine tuning your own model, you can click “Create New Fine-Tune” on the “Models” page and follow the steps, or you can use our API or SDK. Note that you will need an correctly formatted training dataset to create a new fine-tune. Please refer to the API documentation for further guidance on dataset format. After creating a new fine-tuning job, you can monitor the progress on the “Fine Tuning Jobs” page. When a fine tune is completed, it will automatically appear as a new model on the “Models” page.


More Efficient Annotation Interface

Restructured the annotation view to show generated and expected output, as well as context side by side, which helps annotators compare results to ground truth more efficiently. We will also soon be adding functionality to better navigate retrieved context chunks and a simplified flow to copy sections from context into the evaluation form.

How to Use

To try the new evaluation layout, you can go to “Annotation” and start labeling any “Scale GP” project.

Reports for Evaluation Results and Annotator Performance

Users are now able to see metrics and charts for evaluation runs, which makes it significantly easier to analyze results, track progress and identify regressions. We currently support charts and metrics for each individual evaluation run, as well as for the collection of all evaluation runs of a single application over time. Admin and manager users can also download reports for further analysis and record keeping.

How to Use

To find evaluation reports, you can either go to a single “Evaluation Run” and switch to the “Metrics” tab, or go to an “Application Detail” page and switch to the “Metrics” tab.

Other Updates

Cloud Platform Deployment

Scale GP can now be fully privately deployed in single-tenant environments for both Microsoft Azure and Amazon Web Services.