When datasets are uploaded, the data they contain is parsed for input type. This enables tables to handle sorting based on the type of input. Ex: numerical inputs can now be sorted by number value instead of string value.
Annotators can now edit tasks that have been completed prior to the tasks being audited. Annotators will not be able to edit tasks that have been audited.
Click Edit on a completed task.This will take users to a page where they can edit previously completed tasks.
The platform now supports two stages of auditing. Each audit stage must be done by different users. A user will only be able to perform a first stage audit after a task is complete. A user will only be able to perform a second stage audit after another auditor has completed the first stage audit.
Managers can now assign users to be auditors for a certain task. A user cannot be assigned as both an L1 and an L2 auditor for a specific task.Just like the task queue for annotators, there is a general audit queue and a queue for each individual auditor.When a user is assigned a task for auditing, the task will be added to their personal queue. However, a task is only eligible for L1 audit if it has been labeled and L2 audit if it has been audited once already. When a user selects “Start Auditing”, it will first take eligible tasks off of their personal queue. If their personal queue is empty, it will take eligible tasks off the general queue.
The aggregate metrics view for multiple evaluation runs now has the ability to aggregate rubrics and metrics across many evaluation runs at once, for a given time period. It also allows users to see numerical metric aggregates from multiple evaluation runs in multiple ways. Users can select between Mean and Median and Aggregate to determine how they want to show the result of the metric across all evaluations.
When configuring a question, users can select a default value for questions. If the human or LLM annotator leaves the question blank, the question will have the default value. This default value will also be used for all aggregate displays and calculations.
Users can set a character limit for text questions when configuring the question. Annotators will not be able to type more than this character limit when answering the question.
The annotation task list is now automatically refreshed after a task is completed. This way, the status of tasks are automatically updated without the user having to do a refresh.
The “Inline LaTeX” toggle on the test case results page now works as expected. Additionally, test cases containing long words break in-line instead of overflowing in the test case results page and annotation UI.
The task dashboard and contributor metrics tables now export properly formatted headers. Additionally, the “Time Spent” columns are now exported in HH:MM:SS format instead of seconds.
External variants are now filtered out from selection in the “Start from existing variant” dropdown when creating a new variant from the application page. This option is completely hidden if all variants are external.
Users can now configure and run flexible evaluations through the UI. This also means that all evaluations can be done through the UI. See below for how to run a flexible evaluation through the UI.
First, navigate to the Evaluation Dataset page in order to create a dataset.Next, use the modal to upload a flexible dataset. Be sure to name the data set and select the type as Flexible.Next, navigate to the Application Variant page to upload outputs for the dataset. Select Upload Outputs.Choose the flexible dataset that was uploaded and upload the corresponding outputs for that dataset. The modal will automatically show the matching columns between your flexible dataset and your outputs. You need to have at least one matching column header in order to upload outputs through the UI.The outputs will show up here.
Configure the Annotation Configuration and Run Evaluation
After uploading the dataset and outputs, users can now run an evaluation through the UI as well for flexible datasets.Navigate to the Evaluations UI and select the application, dataset, and question set. Note that this step is mandatory for flexible evaluation dataset schemas.If you choose to run a Contributor evaluation, you can now configure the layout of what the human annotator will see when they run the evaluation.
Select Multiple Metrics Calculated with Different Columns from the Dataset
Users can now select several different of the same evaluation metrics to compare. For example, users can select multiple Bleu scores for each evaluation run and indicate which data columns will be used to compute the Bleu score.
All columns in dataset schema is available on Task Dashboard as Hidden Column
Users can now search for task related data and filter the rows on the task dashboard based on the data in the test case of the relevant dataset. This simplifies the process of finding test cases based on their content in the annotation task dashboard.
Fixed an issue where Autoevaluation results were not showing up on the test case results view for Hybrid Evals if the Manual evaluation was not yet done.
Since reasoning and prompt only apply to autoevaluations, they only show up for autoevaluations now. Previously was showing up as empty for Human Evaluations.
Variables now properly show up in both the auto evaluation prompt editor and metrics configuration drop down selectors for fully flexible datasets in ReBAC enabled deployments.
Auto Evaluation Support for all Flexible Evaluations
Users can now run auto evaluations for any flexible dataset. The prompt template is fully customizable and allows users to insert any variables defined in the dataset. Variables for the dataset will show up on the side of the prompt template.
Maintain consistent use of “test case ID” across annotation task dashboard and evaluation results page. IDs can easily be used to compare and match tasks.
Filter Task Dashboard Based On Responses to Evaluation Questions
Users now have the ability to see the responses to evaluation questions on the task dashboard. These columns default as hidden, but can be shown and used to filter the table.
Wrong timestamps for “Updated At” in task dashboard
Fixed an issue that populated “updated at” for all tasks of an auto evaluation with the time the overall auto-evaluation run finished. Now, each task shows the correct time it was last updated during an auto-evaluation run.
Contributor metrics aggregation shows correct amount of time for auto evaluation to be run
Fixed the issue that the time taken by the LLM evaluator was wrongly aggregated. As the LLM evaluator executes tasks in parallel, the total time for the evaluation run is equal to the actual runtime of the model, instead of the sum of the individual tasks (which is the case for contributor evaluators)
Fixed a caching issue in our custom prompt store for autoevaluations. Now all custom prompts are saved upon configuration and executed for autoevaluations.
Auto Evaluation Support for Summarization and Translation Use Cases
Auto Evaluations are now available for summarization and translation use cases on the platform! Users can kick off an auto evaluation for summarization and translation evaluations through either the UI and SDK.
Users can modify the prompt for autoevaluations, and then see the prompt and reasoning in the results page for the autoevaluation.
Kick off an auto evaluation.
Configure the auto evaluation. Choose model and edit prompt.
View Autoevaluation results in the test case results view
View Autoevaluation results in the annotations view
For more information on how to kick off Auto-evaluations, check out our documentation.
Users can now save their progress on a task and come back to it. The timer will only count the time they worked on the task towards the total time of the task. When the user resumes a task they have saved, they will be able to see the questions they’ve already answered.
If another user picks up the task, they will not be able to see any of the original user’s progress.
We launched a series of improvements to the annotations UI for our annotators.
Checkmark icons will now only show up for completed questions. Before, we show green vs gray checkmarks on all questions with green indicating that the question is completed. This UX is challenging for color blind users.
The UX for the previous and next questions has been improved. The “Previous” button will only show up if the current question is not the first question. The “Next” button will only show up if the current question is not the last question.
URLs in the annotation view can now be clicked on.
We have set up a healthcheck alert in SGP Azure VPC environments to enable quick turnaround for any service downtime. Upon failure of the healthcheck, an alert rule will trigger a PagerDuty action group, which can be connected to both Scale and customer PagerDuty services.
The ID rendered on the annotation view (from the “Start Labeling” flow) now properly reflects the value in the “Test Case ID” column on the task dashboard. Additionally, the truncated IDs on the task dashboard are now searchable using the fully qualified ID in browser search.
Users now have the ability to compare application performance across all variants inside the SGP platform through the Aggregate Metrics View.
Users can access this feature by navigating to an application and selecting “Show Metrics” on the actions tab above.
This view allows you to see and compare the performance of all variant runs that were run across the same dataset and rubric.Users can also choose to view in a Table View by toggling the button from Graph to Table. In the table view, users can export the data by selecting the “Export as CSV” Button.
Removed Unused Filter Options from Annotation Project Overview
Removed unused options for “Studio” and “Scale GP” to filter Annotation projects, which were artifacts of previous versions of the platform. This avoids user confusion.
After a task is claimed it was not immediately showing up as such in the task dashboard without a refresh of the page. With this fix, the task shows as claimed immediately when clicking the button.
The Task Dashboard previously had long loading times if more than a few hundred tasks were included in a project. After optimizing frontend queries and introducing pagination, loading times have been reduced significantly.
Users now have the ability to conduct human-expert evaluations entirely within Scale GP, eliminating the need to switch to Scale Studio as previously required. This capability is made possible through the implementation of a new task queuing backend and a user interface designed to facilitate the viewing and initiation of evaluation tasks. Features allowing users to prioritize annotation projects and assign annotators to those projects are scheduled for release in the near future.To facilitate seamless annotation workflows, Scale GP offers role based access control with three user roles: Admin, Manager and Annotator. Managers and Admins will be able to interact with the entire Scale GenAI Platform and manage evaluation runs, including assigning annotation projects and priorities, while annotators (typically SMEs) can focus entirely on the evaluation tasksHow to UseTo get started, users can switch to the new “Annotation” tab on the sidebar and view currently active annotation projects that are ready for labeling. The view also allows users to view legacy annotation projects which are executed using the external Scale Studio queuing system
Users can now create and manage their Scale GP knowledge bases via the UI. This makes it significantly easier to monitor the assets uploaded in a knowledge base and to track longer uploads as they are progressing. It also enables users to quickly find and use a relevant knowledge base using the AI playground or when building a custom application with our SDK
How to UseTo get started navigate to the “Knowledge Bases” tab on the left to see your existing resources. If there are none, you can create a new knowledge base via the API or directly in the UI, by uploading assets from your local device. We will soon follow up with automated ingest from data sources like S3 and Azure Blob storage
Users can now seamlessly build sample GenAI applications in Scale GP without writing a single line of code. The AI playground provides application templates for typical GenAI use cases such as Retrieval Augmented Generation (RAG), which can be configured in a no-code setup flow, selecting any of the knowledge bases and models that are available on the platform. After completing a setup, users can easily export the sample application for further customization or deployment.
How to UseTo build your first application, switch to the “Playground” tab and begin by selecting a template to configure, then follow the instructions and experiment with components until the desired outcome is achieved.
Restructured the annotation view to show generated and expected output, as well as context side by side, which helps annotators compare results to ground truth more efficiently. We will also soon be adding functionality to better navigate retrieved context chunks and a simplified flow to copy sections from context into the evaluation form.
How to UseTo get started with fine tuning your own model, you can click “Create New Fine-Tune” on the “Models” page and follow the steps, or you can use our API or SDK. Note that you will need an correctly formatted training dataset to create a new fine-tune. Please refer to the API documentation for further guidance on dataset format.
After creating a new fine-tuning job, you can monitor the progress on the “Fine Tuning Jobs” page. When a fine tune is completed, it will automatically appear as a new model on the “Models” page.
Restructured the annotation view to show generated and expected output, as well as context side by side, which helps annotators compare results to ground truth more efficiently. We will also soon be adding functionality to better navigate retrieved context chunks and a simplified flow to copy sections from context into the evaluation form.
How to UseTo try the new evaluation layout, you can go to “Annotation” and start labeling any “Scale GP” project.
Reports for Evaluation Results and Annotator Performance
Users are now able to see metrics and charts for evaluation runs, which makes it significantly easier to analyze results, track progress and identify regressions. We currently support charts and metrics for each individual evaluation run, as well as for the collection of all evaluation runs of a single application over time. Admin and manager users can also download reports for further analysis and record keeping.
How to UseTo find evaluation reports, you can either go to a single “Evaluation Run” and switch to the “Metrics” tab, or go to an “Application Detail” page and switch to the “Metrics” tab.