2025-04-08

v25.3.1

Version 25.3.1

03/05/2025

Bug Fixes

Data Types in Datasets are Parsed

When datasets are uploaded, the data they contain is parsed for input type. This enables tables to handle sorting based on the type of input. Ex: numerical inputs can now be sorted by number value instead of string value.

2025-04-08

v25.3.0

Version 25.3.0

04/08/2025

New Features

Annotators Can Edit Tasks

Annotators can now edit tasks that have been completed prior to the tasks being audited. Annotators will not be able to edit tasks that have been audited. Click Edit on a completed task.

This will take users to a page where they can edit previously completed tasks.

Version History

Users can see the version history of all tasks. The version history shows all edits and fixes that have been made from audits.

Multi-Stage Auditing

The platform now supports two stages of auditing. Each audit stage must be done by different users. A user will only be able to perform a first stage audit after a task is complete. A user will only be able to perform a second stage audit after another auditor has completed the first stage audit.

Assigning Auditors and Auditing Queue

Managers can now assign users to be auditors for a certain task. A user cannot be assigned as both an L1 and an L2 auditor for a specific task.Just like the task queue for annotators, there is a general audit queue and a queue for each individual auditor.When a user is assigned a task for auditing, the task will be added to their personal queue. However, a task is only eligible for L1 audit if it has been labeled and L2 audit if it has been audited once already. When a user selects “Start Auditing”, it will first take eligible tasks off of their personal queue. If their personal queue is empty, it will take eligible tasks off the general queue.

Assign Auditors

Click “Assign Auditors” to assign an auditor to a task.

Clicking Audit Tasks

If a user has permission to Audit tasks, they can see the Audit Tasks button.

Minor Features

Aggregate Metrics Across Evaluation Runs

The aggregate metrics view for multiple evaluation runs now has the ability to aggregate rubrics and metrics across many evaluation runs at once, for a given time period. It also allows users to see numerical metric aggregates from multiple evaluation runs in multiple ways. Users can select between Mean and Median and Aggregate to determine how they want to show the result of the metric across all evaluations.

Mandate Comment by Annotator for Flagged Tasks

Annotators will be required to add a comment explaining why if they choose to flag the task.

User can add multiple conditional questions

Users can now add multiple conditional questions when configuring a rubric.

UX Improvements

Contributor Metrics Graph

Contributor Metrics graph shows contributors by name instead of email, enhancing readability.

Conditional Questions Display for Annotators

Conditional questions will show up in the same box for annotators, making the relationship more clear to users.

Default Value for Questions

When configuring a question, users can select a default value for questions. If the human or LLM annotator leaves the question blank, the question will have the default value. This default value will also be used for all aggregate displays and calculations.

Set Character Limit

Users can set a character limit for text questions when configuring the question. Annotators will not be able to type more than this character limit when answering the question.

Annotation Task List is Automatically Refreshed

The annotation task list is now automatically refreshed after a task is completed. This way, the status of tasks are automatically updated without the user having to do a refresh.

Bug Fixes

API Validation for Numerical Questions

Added API validation that requires users to select min and max values when creating numerical questions.

Aggregate Metrics View Shows Both Auto-Evaluation and Manual Evaluation Results

For hybrid evaluations, the aggregate metrics view now allows users to toggle between auto-evaluation and manual evaluation results.

Overflow Issues on Annotation Configuration

Fixed two overflow issues on the settings page where users configure the Annotation layout.

Toggle to Split Chunks in Annotation View is Back

Fixed bug that caused the toggle that allows annotators to split the chunks in the Annotation view disappear. The toggle is now back.

2025-03-05

v25.2.1

Version 25.2.1

03/05/2025

Bug Fixes

Text Overflow Issues

The “Inline LaTeX” toggle on the test case results page now works as expected. Additionally, test cases containing long words break in-line instead of overflowing in the test case results page and annotation UI.

Table CSV Export

The task dashboard and contributor metrics tables now export properly formatted headers. Additionally, the “Time Spent” columns are now exported in HH:MM:SS format instead of seconds.

Application Variant Creation

External variants are now filtered out from selection in the “Start from existing variant” dropdown when creating a new variant from the application page. This option is completely hidden if all variants are external.

Annotation UI Configuration

Users can now configure the annotation UI for flexible datasets without selecting “Contributor” as an evaluator.

Empty Columns

Columns that previously had an empty title in the “Customize Columns” popovers now display the actual column header.

2025-02-28

v25.2.0

Version 25.2.0

02/28/2025

New Features

Flexible Evaluations Through the UI

Users can now configure and run flexible evaluations through the UI. This also means that all evaluations can be done through the UI. See below for how to run a flexible evaluation through the UI.

Upload Flexible Datasets and Outputs

First, navigate to the Evaluation Dataset page in order to create a dataset.

Next, use the modal to upload a flexible dataset. Be sure to name the data set and select the type as Flexible.

Next, navigate to the Application Variant page to upload outputs for the dataset. Select Upload Outputs.

Choose the flexible dataset that was uploaded and upload the corresponding outputs for that dataset. The modal will automatically show the matching columns between your flexible dataset and your outputs. You need to have at least one matching column header in order to upload outputs through the UI.

The outputs will show up here.

Configure the Annotation Configuration and Run Evaluation

After uploading the dataset and outputs, users can now run an evaluation through the UI as well for flexible datasets.

Navigate to the Evaluations UI and select the application, dataset, and question set. Note that this step is mandatory for flexible evaluation dataset schemas.

If you choose to run a Contributor evaluation, you can now configure the layout of what the human annotator will see when they run the evaluation.

Select Multiple Metrics Calculated with Different Columns from the Dataset

Users can now select several different of the same evaluation metrics to compare. For example, users can select multiple Bleu scores for each evaluation run and indicate which data columns will be used to compute the Bleu score.

Minor Features

All columns in dataset schema is available on Task Dashboard as Hidden Column

Users can now search for task related data and filter the rows on the task dashboard based on the data in the test case of the relevant dataset. This simplifies the process of finding test cases based on their content in the annotation task dashboard.

Task Queue and Contributor Metrics Can be Exported as CSV

Users can now export Task Queue and Contributor Metrics Can be Exported as CSV.

Customizable Filters For Columns in Contributor Metrics

Contributor Metrics Table now has customizable filters for columns in the contributor task dashboard.

Bug Fixes

Can now filter by “Needs Review” column in the Task Dashboard

Fixed an issue where the needs review column filter in the task dashboard was not working. Users can now filter by that column.

Can now clear numeric values in annotation view

Fixed a bug where numerical question types could not be cleared by annotators.

Fixed Test Case Results Page for Hybrid Evals

Fixed an issue where Autoevaluation results were not showing up on the test case results view for Hybrid Evals if the Manual evaluation was not yet done.

Test Case Results View Works with Latex

Latex inline formatting shows up for test case results view now.

Reasoning and Prompt Hidden for Human Evaluations

Since reasoning and prompt only apply to autoevaluations, they only show up for autoevaluations now. Previously was showing up as empty for Human Evaluations.

2025-02-19

v25.1.3

Version 25.1.3

02/19/2025

Bug Fix

Flexible Evaluations

Variables now properly show up in both the auto evaluation prompt editor and metrics configuration drop down selectors for fully flexible datasets in ReBAC enabled deployments.

2025-02-11

v25.1.2

Version 25.1.2

02/11/2025

Bug Fix

PII Attestation

Patching a state save issue with the PII attestation confirmation modal.

2025-02-10

v25.1.1

Version 25.1.1

02/10/2025

New Features

PII Attestation

Deployments can enabled an acknowledgement that requires users to confirm that the data they are uploading does not contain PII or MNPI.

2025-02-04

v25.1.0

Version 25.1.0

02/04/2025

New Features

Select Metrics to Calculate When Configuring Evaluation Runs

Users can now select preconfigured metrics to calculate during an evaluation in addition to the questions answered on the rubric.

Currently, users can select from Bleu, Rouge, Meteor, and Cosine Similarity.
After running an evaluation, the selected metrics will show up in the metrics tab for evaluation results.
The selected metrics will also show up in the table.

Expanded Auto Evaluation Capabilities

Auto Evaluation Support for all Flexible Evaluations

Users can now run auto evaluations for any flexible dataset. The prompt template is fully customizable and allows users to insert any variables defined in the dataset. Variables for the dataset will show up on the side of the prompt template.

Modify Batch Job Size and Hyperparameters

Users can now modify the batch job size and hyperparameters when configuring autoevaluations.

Other Improvements to the evaluations workflow

Added a tooltip that shows the difference between human and LLM evaluators.When a user searches for a keyword that has a match in a hidden column, a tooltip will let the user know that the search result is hidden.Clicking on LLM for completed auto evaluation results will open a modal that shows the LLM used and the prompt used for the autoevaluation.

Maintain Consistent use of Test Case ID

Maintain consistent use of “test case ID” across annotation task dashboard and evaluation results page. IDs can easily be used to compare and match tasks.

Other Minor UI Fixes

Ability to edit application name, application description, variant name and description on the UI

Users are able to edit the variant name and description in the UI by clicking on the edit button.

Tasking Improvements

Bulk Assign Users to Tasks

The platform now has the ability to bulk assign tasks to labelers.

Filter Task Dashboard Based On Responses to Evaluation Questions

Users now have the ability to see the responses to evaluation questions on the task dashboard. These columns default as hidden, but can be shown and used to filter the table.

Bug Fixes

Wrong timestamps for “Updated At” in task dashboard

Fixed an issue that populated “updated at” for all tasks of an auto evaluation with the time the overall auto-evaluation run finished. Now, each task shows the correct time it was last updated during an auto-evaluation run.

Contributor metrics aggregation shows correct amount of time for auto evaluation to be run

Fixed the issue that the time taken by the LLM evaluator was wrongly aggregated. As the LLM evaluator executes tasks in parallel, the total time for the evaluation run is equal to the actual runtime of the model, instead of the sum of the individual tasks (which is the case for contributor evaluators)

Show number of rows in the evaluation set

Fixed an issue where the number of test cases in a given dataset was showing 0 for all datasets

Consistent sorting algorithm on tables

The content of tables are now sorted by special characters first and alphabetically second.

Table filters are displayed properly (overflow issue fix)

Fixed an overflow issue in the CSS so that table filters now show up on top of the table properly.

Custom Prompts are Properly Executed

Fixed a caching issue in our custom prompt store for autoevaluations. Now all custom prompts are saved upon configuration and executed for autoevaluations.

2025-01-07

v24.12.0

Version 24.12.0

01/07/2025

New Features

Auto Evaluation Support for Summarization and Translation Use Cases

Auto Evaluations are now available for summarization and translation use cases on the platform! Users can kick off an auto evaluation for summarization and translation evaluations through either the UI and SDK. Users can modify the prompt for autoevaluations, and then see the prompt and reasoning in the results page for the autoevaluation.

Kick off an auto evaluation.
Configure the auto evaluation. Choose model and edit prompt.
View Autoevaluation results in the test case results view
View Autoevaluation results in the annotations view
For more information on how to kick off Auto-evaluations, check out our documentation.

Save Progress While Doing Annotations

Users can now save their progress on a task and come back to it. The timer will only count the time they worked on the task towards the total time of the task. When the user resumes a task they have saved, they will be able to see the questions they’ve already answered. If another user picks up the task, they will not be able to see any of the original user’s progress.

Task Dashboard Improvements

On the task dashboard, users now have the ability to search for tasks.
Users can also filter by the user, task ID, date, and time for each column.
Users can download and export contributor metrics.
Users can filter the contributor metrics table by submissions, contributor, tasks fixed, time spent, and efficiency.

Application Variant Page Improvements

The Variant Description now shows up on the Application Variant Page.

Annotation UI Improvements

We launched a series of improvements to the annotations UI for our annotators.

Checkmark icons will now only show up for completed questions. Before, we show green vs gray checkmarks on all questions with green indicating that the question is completed. This UX is challenging for color blind users.
The UX for the previous and next questions has been improved. The “Previous” button will only show up if the current question is not the first question. The “Next” button will only show up if the current question is not the last question.
URLs in the annotation view can now be clicked on.

Alerting and Paging

We have set up a healthcheck alert in SGP Azure VPC environments to enable quick turnaround for any service downtime. Upon failure of the healthcheck, an alert rule will trigger a PagerDuty action group, which can be connected to both Scale and customer PagerDuty services.

Bug Fixes

Application Variant Shows Up After Creation

After creating an application variant, it now shows up immediately without refreshing.

Search works for Raw View in Annotations View

Search now works for Raw View for Annotations

2024-12-06

v24.11.3

Version 24.11.3

12/06/2024

New Features

OpenAI MSAL support

VPC deployments can now use the MSAL confidential client application authorization flow for OpenAI proxy/gateway credentials.

Bug Fixes

Audit view state updates

Tasks completed by the current user are now auditable through the Task Dashboard without requiring a browser refresh.

2024-11-26

v24.11.2

Version 24.11.2

11/26/2024

Bug Fixes

Task Dashboard state updates

Updates to the current user’s in-progress and completed tasks are now properly synced on the Task Dashboard without requiring a browser refresh.

Evaluation run comparison

The “Evaluation Comparison” page for an evaluation run no longer surfaces itself as a comparison option.

2024-11-20

v24.11.1

Version 24.11.1

11/20/2024

Bug Fixes

Evaluation run table view

Evaluations that use rubrics with a “Free Text” question type now render the table view without error.

Task ID rendering

The ID rendered on the annotation view (from the “Start Labeling” flow) now properly reflects the value in the “Test Case ID” column on the task dashboard. Additionally, the truncated IDs on the task dashboard are now searchable using the fully qualified ID in browser search.

2024-11-15

v24.11.0

Version 24.11.0

11/15/2024

New Features

Aggregate Metrics View

Users now have the ability to compare application performance across all variants inside the SGP platform through the Aggregate Metrics View.

Users can access this feature by navigating to an application and selecting “Show Metrics” on the actions tab above.

This view allows you to see and compare the performance of all variant runs that were run across the same dataset and rubric.Users can also choose to view in a Table View by toggling the button from Graph to Table. In the table view, users can export the data by selecting the “Export as CSV” Button.

Assign Tasks to Specific Labelers in UI

After an evaluation is kicked-off, users with admin permissions will be able to assign tasks to contributors through the UI on the task dashboard!

Removed unused options for “Studio” and “Scale GP” to filter Annotation projects, which were artifacts of previous versions of the platform. This avoids user confusion.

Version Numbering

SGP will now display the software version in the platform UI. You can see the version by opening the bottom left context menu.

Bug Fixes

Claimed Tasks not refreshed

After a task is claimed it was not immediately showing up as such in the task dashboard without a refresh of the page. With this fix, the task shows as claimed immediately when clicking the button.

Task Dashboard Loading Time

The Task Dashboard previously had long loading times if more than a few hundred tasks were included in a project. After optimizing frontend queries and introducing pagination, loading times have been reduced significantly.

2024-02-23

v0.9.1

Version 0.9.1

02-23-2024

New Features

1. End to end Annotation Functionality in ScaleGP

Users now have the ability to conduct human-expert evaluations entirely within Scale GP, eliminating the need to switch to Scale Studio as previously required. This capability is made possible through the implementation of a new task queuing backend and a user interface designed to facilitate the viewing and initiation of evaluation tasks. Features allowing users to prioritize annotation projects and assign annotators to those projects are scheduled for release in the near future.To facilitate seamless annotation workflows, Scale GP offers role based access control with three user roles: Admin, Manager and Annotator. Managers and Admins will be able to interact with the entire Scale GenAI Platform and manage evaluation runs, including assigning annotation projects and priorities, while annotators (typically SMEs) can focus entirely on the evaluation tasksHow to UseTo get started, users can switch to the new “Annotation” tab on the sidebar and view currently active annotation projects that are ready for labeling. The view also allows users to view legacy annotation projects which are executed using the external Scale Studio queuing system

2. Knowledge Base Management

Users can now create and manage their Scale GP knowledge bases via the UI. This makes it significantly easier to monitor the assets uploaded in a knowledge base and to track longer uploads as they are progressing. It also enables users to quickly find and use a relevant knowledge base using the AI playground or when building a custom application with our SDK

How to UseTo get started navigate to the “Knowledge Bases” tab on the left to see your existing resources. If there are none, you can create a new knowledge base via the API or directly in the UI, by uploading assets from your local device. We will soon follow up with automated ingest from data sources like S3 and Azure Blob storage

AI Playground

Users can now seamlessly build sample GenAI applications in Scale GP without writing a single line of code. The AI playground provides application templates for typical GenAI use cases such as Retrieval Augmented Generation (RAG), which can be configured in a no-code setup flow, selecting any of the knowledge bases and models that are available on the platform. After completing a setup, users can easily export the sample application for further customization or deployment.

How to UseTo build your first application, switch to the “Playground” tab and begin by selecting a template to configure, then follow the instructions and experiment with components until the desired outcome is achieved.

Self-Service Model Fine Tuning

Restructured the annotation view to show generated and expected output, as well as context side by side, which helps annotators compare results to ground truth more efficiently. We will also soon be adding functionality to better navigate retrieved context chunks and a simplified flow to copy sections from context into the evaluation form.

How to UseTo get started with fine tuning your own model, you can click “Create New Fine-Tune” on the “Models” page and follow the steps, or you can use our API or SDK. Note that you will need an correctly formatted training dataset to create a new fine-tune. Please refer to the API documentation for further guidance on dataset format. After creating a new fine-tuning job, you can monitor the progress on the “Fine Tuning Jobs” page. When a fine tune is completed, it will automatically appear as a new model on the “Models” page.

Enhancements

More Efficient Annotation Interface

How to UseTo try the new evaluation layout, you can go to “Annotation” and start labeling any “Scale GP” project.

Reports for Evaluation Results and Annotator Performance

Users are now able to see metrics and charts for evaluation runs, which makes it significantly easier to analyze results, track progress and identify regressions. We currently support charts and metrics for each individual evaluation run, as well as for the collection of all evaluation runs of a single application over time. Admin and manager users can also download reports for further analysis and record keeping.

How to UseTo find evaluation reports, you can either go to a single “Evaluation Run” and switch to the “Metrics” tab, or go to an “Application Detail” page and switch to the “Metrics” tab.

Other Updates

Cloud Platform Deployment

Scale GP can now be fully privately deployed in single-tenant environments for both Microsoft Azure and Amazon Web Services.

Release Notes

​Version 25.3.1

​Bug Fixes

​Data Types in Datasets are Parsed

​Version 25.3.0

​New Features

​Annotators Can Edit Tasks

​Version History

​Multi-Stage Auditing

​Assigning Auditors and Auditing Queue

​Assign Auditors

​Clicking Audit Tasks

​Minor Features

​Aggregate Metrics Across Evaluation Runs

​Mandate Comment by Annotator for Flagged Tasks

​User can add multiple conditional questions

​UX Improvements

​Contributor Metrics Graph

​Conditional Questions Display for Annotators

​Default Value for Questions

​Set Character Limit

​Annotation Task List is Automatically Refreshed

​Bug Fixes

​API Validation for Numerical Questions

​Aggregate Metrics View Shows Both Auto-Evaluation and Manual Evaluation Results

​Overflow Issues on Annotation Configuration

​Toggle to Split Chunks in Annotation View is Back

​Version 25.2.1

​Bug Fixes

​Text Overflow Issues

​Table CSV Export

​Application Variant Creation

​Annotation UI Configuration

​Empty Columns

​Version 25.2.0

​New Features

​Flexible Evaluations Through the UI

​Upload Flexible Datasets and Outputs

​Configure the Annotation Configuration and Run Evaluation

​Select Multiple Metrics Calculated with Different Columns from the Dataset

​Minor Features

​All columns in dataset schema is available on Task Dashboard as Hidden Column

​Task Queue and Contributor Metrics Can be Exported as CSV

​Customizable Filters For Columns in Contributor Metrics

​Bug Fixes

​Can now filter by “Needs Review” column in the Task Dashboard

​Can now clear numeric values in annotation view

​Fixed Test Case Results Page for Hybrid Evals

​Test Case Results View Works with Latex

​Reasoning and Prompt Hidden for Human Evaluations

​Version 25.1.3

​Bug Fix

​Flexible Evaluations

​Version 25.1.2

​Bug Fix

​PII Attestation

​Version 25.1.1

​New Features

​PII Attestation

​Version 25.1.0

​New Features

​Select Metrics to Calculate When Configuring Evaluation Runs

​Expanded Auto Evaluation Capabilities

​Auto Evaluation Support for all Flexible Evaluations

​Modify Batch Job Size and Hyperparameters

​Other Improvements to the evaluations workflow

​Tooltip explaining human versus LLM evaluators

​Show tooltip for search matches in a hidden column

​Show Modal with LLM Used and Prompt

​Maintain Consistent use of Test Case ID

​Other Minor UI Fixes

​Ability to edit application name, application description, variant name and description on the UI

​Tasking Improvements

​Bulk Assign Users to Tasks

​Filter Task Dashboard Based On Responses to Evaluation Questions

​Bug Fixes

​Wrong timestamps for “Updated At” in task dashboard

​Contributor metrics aggregation shows correct amount of time for auto evaluation to be run

​Show number of rows in the evaluation set

​Consistent sorting algorithm on tables

Version 25.3.1

Bug Fixes

Data Types in Datasets are Parsed

Version 25.3.0

New Features

Annotators Can Edit Tasks

Version History

Multi-Stage Auditing

Assigning Auditors and Auditing Queue

Assign Auditors

Clicking Audit Tasks

Minor Features

Aggregate Metrics Across Evaluation Runs

Mandate Comment by Annotator for Flagged Tasks

User can add multiple conditional questions

UX Improvements

Contributor Metrics Graph

Conditional Questions Display for Annotators

Default Value for Questions

Set Character Limit

Annotation Task List is Automatically Refreshed

Bug Fixes

API Validation for Numerical Questions

Aggregate Metrics View Shows Both Auto-Evaluation and Manual Evaluation Results

Overflow Issues on Annotation Configuration

Toggle to Split Chunks in Annotation View is Back

Version 25.2.1

Bug Fixes

Text Overflow Issues

Table CSV Export

Application Variant Creation

Annotation UI Configuration

Empty Columns

Version 25.2.0

New Features

Flexible Evaluations Through the UI

Upload Flexible Datasets and Outputs

Configure the Annotation Configuration and Run Evaluation

Select Multiple Metrics Calculated with Different Columns from the Dataset

Minor Features

All columns in dataset schema is available on Task Dashboard as Hidden Column

Task Queue and Contributor Metrics Can be Exported as CSV

Customizable Filters For Columns in Contributor Metrics

Bug Fixes

Can now filter by “Needs Review” column in the Task Dashboard

Can now clear numeric values in annotation view

Fixed Test Case Results Page for Hybrid Evals

Test Case Results View Works with Latex

Reasoning and Prompt Hidden for Human Evaluations

Version 25.1.3

Bug Fix

Flexible Evaluations

Version 25.1.2

Bug Fix

PII Attestation

Version 25.1.1

New Features

PII Attestation

Version 25.1.0

New Features

Select Metrics to Calculate When Configuring Evaluation Runs

Expanded Auto Evaluation Capabilities

Auto Evaluation Support for all Flexible Evaluations

Modify Batch Job Size and Hyperparameters

Other Improvements to the evaluations workflow

Tooltip explaining human versus LLM evaluators

Show tooltip for search matches in a hidden column

Show Modal with LLM Used and Prompt

Maintain Consistent use of Test Case ID

Other Minor UI Fixes

Ability to edit application name, application description, variant name and description on the UI

Tasking Improvements

Bulk Assign Users to Tasks

Filter Task Dashboard Based On Responses to Evaluation Questions

Bug Fixes

Wrong timestamps for “Updated At” in task dashboard

Contributor metrics aggregation shows correct amount of time for auto evaluation to be run

Show number of rows in the evaluation set

Consistent sorting algorithm on tables

Table filters are displayed properly (overflow issue fix)