@font-face {
    font-family: 'Aeonik';
    src: url('https://fonts.gstatic.com/s/inter/v18/UcC73FwrK3iLTeHuS_nVMrMxCp50SjIa1ZL7W0Q5nw.woff2');
    font-weight: normal;
    font-style: normal;
    font-display: swap;
  }
  
  h1,
  h2,
  h3 {
    font-family: 'aeonik';
    letter-spacing: normal !important;
  }

Navigate to the Evaluate tab and click “Create evaluation”.

Add Evaluation Details

Add an LLM Judge

Configure LLM Judge

Create Evaluation

Data

Overview

View Evaluation Results

Leverage LLM-based tasks to produce evaluation results.

LLM as a Judge

Scale AI

Home

Guides

API Reference

SDK Docs

Recipes

Release Notes

Platform

Build, test, and optimize custom Generative AI applications that unlock the value of your data.

Introduction

Authentication

Getting started with the ScaleGP APIs and SDK

Quick Start

This guide provides initial explanations and pointers for all functions in the SGP UI.

Intro to the SGP UI

On overview of how SGP's Agent Building Framework.

Introduction to Agent Service

Getting started with building, testing and monitoring agents in SGP

Create Agents in SGP

First steps for building agents with YAML.

Hello World

Build a chat-bot for retrieval augmented generation.

Implementing a workflow that will iterate on a condition

Loops

Combining multiple workflows to solve tasks.

State Machines

Build workflows and state machines with conditional paths.

Conditions and Branches

Dynamically evaluate agents per workflow or per node

Evaluation

How to execute interact with agents in the SGP UI.

Running Agents in SGP

How to interact with agents built in Agent Service.

Using Agent Service Endpoints

Agent Service Node Library

Weather App With Tools

Search Chat App With State

Evaluation runs are ways to evaluate the performance of your application.

How to Run an Evaluation

For manual evaluations, the application variant will be run against the evaluation dataset and then humans from your team will annotate the results based on the rubric.

Manual Evaluations

The Scale confidence score is a measure of the general performance of your application that is produced when you run the Scale report card on an application variant.

Scale Confidence Score

Quantify the performance of an evaluation.

Evaluation Metrics

Auto Evaluations Overview

Auto Evaluations for Generation Datasets

Auto-evaluations for Summarization Use Cases

Auto-evaluations for Translation Use Cases

What is a Flexible Evaluation

Simple Guide: Evaluating a Math Bot

Full Guide To Flexible Evaluation

Annotation Configuration

How to create and evaluate a multiturn application

Multiturn Evaluation

How to create and evaluate a summarization application.

Summarization Evaluation

How to create and evaluate a translation application.

Translation Evaluation

How to evaluate an application with multimodal inputs

Creating a Multimodal Evaluation

How to upload files to run with multimodal inputs

Uploading a file

A simplified and extendable update to SGP's evaluation framework

Introduction to Next Gen Evals

Auto Evaluation

Defer evaluation tasks to human contributors.

Contributor Evaluation

Gain deep visibility into your workflows with tracing.

Introduction to Tracing

Set up and initialize the Scale GP Tracing SDK in your Python applications.

Initialization

Instrument your Python code with the Tracing SDK.

Creating Traces and Spans

Access the currently active span and trace in your application's execution context.

Helper Methods

Understand how and when to manually send buffered tracing data to SGP.

Flushing Tracing Data

Configure the SGP Tracing SDK's behavior.

Configuration Options

Implement tracing across multiple processes and workers in your Python app.

Distributed Tracing

A data source describes a location at which a user's data is stored. When connected, the GenAI platform is able to read data from that location. This allows knowledge bases to be created from the data source.

Data Sources

A knowledge base is a centralized repository for information. Knowledge bases allow you to store, organize, and provide access to information (ingest and query on your data).

Knowledge Bases

Once you have a **Data Source** set up and a **Knowledge Base** connected to that **Data Source**, you want to make sure your **Knowledge Base** is up to date. You can ensure that any changes made to the data inside your Source will be updated by setting an upload schedule.

Scheduling Uploads

Applications in SGP

Executable applications are applications that you can build and run inside Scale's Generative AI Platform.

Native SGP Applications

Scale Generative AI Platform provides the ability for users to create fully executable applications directly within the platform.

Application Builder

Scale Generative AI Platform currently supports 4 application templates.

Templates

Integrate off-platform applications with SGP for evaluation

External Applications

Once you are ready to test your applications in front of end users, you can deploy your application inside Scale Generative AI Platform.

Deploying a Variant

Scale Generative AI Platform supports basic customization options for the UI of your application.

UI Customization

Once you have [deployed a variant](/docs/deploying-a-variant), you can interact with it as an application via the API endpoint or via a web application as shown below.

Interacting with an Application

After you deploy your application you can utilize the platform to monitor the performance of your applicaiton.

Monitoring Applications

If you already have an existing dataset that you want to use to evaluate your application inside the Scale Generative AI Platform, you can manually upload it onto the platform. You can use either the UI or the SDK.

Manual Upload

Generating datasets manually is a known tedious task. If you have a Knowledge Base set up for your application in the platform, you can also automatically generate datasets for evaluation through the platform. Often times, the best evaluations are done through having a diverse dataset, so we recommend evaluating your application against a mix of manually and synthetically generated datasets.

Using Knowledge Base

In addition to manually uploading datasets or auto generating form your knowledge, you can also autogenerate datasets centered around specific harms, and tweak the techniques used to create test cases.

Safety Dataset

An overview of Scale Generative Platform's Inference Capabilities

Inference Overview

Deprecation Schedule

Definitions for vocabulary used throughout the documentation

Glossary

Contributor Controls

Admin Controls

SGP's annotation system is built around a queuing system for human annotators or contributors.

SGP Annotation System

Retrieval

Evaluating large language models is intrinsically difficult because of the subjective nature of responses to open ended requests.

Evaluations

Build and evaluate 3 math answering bots of increasing complexity using Flexible Evaluations.

Create a Flexible Evaluation

Customize the UI that annotators see while annotating evaluations

Customize Annotation Config for Evaluations

Generate a report card for an application variant

Generate an Application Report Card

Use this recipe to create a simple completion application

Create a Completion Application

Create a dataset and manually upload test cases. Evaluation datasets contain a set of test cases used to evaluate the performance of your applications.

Create a manual dataset

Create an evaluation dataset and autogenerate test cases from a knowledge base. Evaluation datasets contain a set of test cases used to evaluate the performance of your applications.

Create an autogenerated evaluation dataset

Create an evaluation dataset and autogenerate test cases based on a list of harms. Harms are a list of negative or undesired topics that the model should not generate or properly handle. Advanced configs for emotions, moods, methods, tones can be provided to generate test cases based on the provided configurations

Create an safety evaluation dataset

Use this recipe to use the Scale GenAI Platform SDK to perform completion and chat completions

Create Completion and Chat Completions

Use this recipe to deploy and execute a gemini-pro completion model

Deploy and Execute a Model

Scale Generative AI Platform Release Notes 

### Description
Lists all knowledge bases owned by the authorized user.

### Details
This API can be used to list all knowledge bases that have been created by the user.         This API will return the details of all knowledge bases including their IDs, names, the         embedding models they use, any metadata associated with the knowledge bases, and the         timestamps for their creation, last-updated time.

#### Backwards Compatibility
V2 and V1 Knowledge Bases are entirely separate and not backwards compatible. Users who         have existing V1 knowledge bases will need to migrate their data to V2 knowledge bases.

List Knowledge Bases

### Description
Creates an EGP knowledge base.

### Details
A knowledge base is a storage device for all data that needs to be accessible to EGP models.         Users can upload data from a variety of data sources into a knowledge base, and then query the         knowledge base for chunks that are semantically relevant to the query.

Every knowledge base must be associated with a fixed embedding model. This embedding model         will be used to embed all data that is stored in the knowledge base. The embedding model         cannot be changed once the knowledge base is created. Only the embedding models in the         dropdown menu below are supported.

#### Differences from V1
- V1 data ingestion consisted of knowledge bases, vector stores, and data connectors.         V1 Knowledge bases interacted with natural language, V1 vector stores interacted with         chunks and embeddings, and V1 data connectors set up automatic ingestion pipelines with third         party data sources.
- In V2, all data ingestion is done through knowledge bases. Low level configuration such as         chunking strategies and data sources are now handled by this unified knowledge base v2
upload API.
- The way data is stores in V2 allows for better observability on the ingestion progress and         content of the knowledge base.
- Reliability and scalability is also improved via distributed temporal workflows.

#### Backwards Compatibility
V2 and V1 Knowledge Bases are entirely separate and not backwards compatible. Users who         have existing V1 knowledge bases will need to migrate their data to V2 knowledge bases.

Create Knowledge Base

### Description
Gets the details of a knowledge base.

### Details
This API can be used to get information about a single knowledge base by ID. To use this API,         pass in the `knowledge_base_id` that was returned from your [Create Knowledge Base API](         https://scale-egp.readme.io/reference/create_knowledge_base_v2) call as a path parameter.

This API will return the details of a knowledge base including its ID, name, the embedding         model it uses, any metadata associated with the knowledge base, and the timestamps for its
creation, last-updated time.

#### Backwards Compatibility
V2 and V1 Knowledge Bases are entirely separate and not backwards compatible. Users who         have existing V1 knowledge bases will need to migrate their data to V2 knowledge bases.

Get Knowledge Base

### Description
Deletes a knowledge base.

### Details
This API can be used to delete a knowledge base by ID. To use this API, pass in the         `knowledge_base_id` that was returned from your [Create Knowledge Base API](         https://scale-egp.readme.io/reference/create_knowledge_base_v2) call as a path parameter.

#### Backwards Compatibility
V2 and V1 Knowledge Bases are entirely separate and not backwards compatible. Users who         have existing V1 knowledge bases will need to migrate their data to V2 knowledge bases.

Delete Knowledge Base

Patch Knowledge Base

List Upload Jobs

### Description
Get chunks from a knowledge base using chunk IDs or a matching metadata field. This API will query from the Vector Database using
the passed in filters and optionally can return the embeddings.

    ### Details
    This API can be used to get a list of chunks from a knowledge base. Given a chunk id,             a metadata field and value, or both, matching chunks are searched for in the knowledge base             given by knowledge base id.

Get Chunks

### Description
Query a knowledge base for text chunks that are most semantically relevant to the query.

### Details
Given a query expressed as an embedding, this API runs a similarity search amongst the         embeddings indexed in the knowledge base to find the most relevant chunk embeddings. To use         this API, specify the `knowledge_base_id` of the knowledge base you want to query, pass in         the natural language `query` that you want to search for, specify the value `top_k`,         which is the number of similar chunks that will be returned, and specify whether you want         the returned chunks to `include_embeddings`.

Similarity search is used to efficiently find, retrieve, and rank chunks based on their         similarity to a given query, which is also expressed as an embedding. Similarity scores (         using the cosine similarity metric) are calculated between each chunk embedding and the         embedded query, and the chunks are ranked based on similarity score. The top-ranked chunks         are returned as the query results.

We are using the Hierarchical Navigable Small World (HNSW) algorithm to perform a k nearest         neighbors search in the vector space. This algorithm returns an estimate of the best k         nearest neighbors and is optimized for datasets with hundreds of thousands of vectors. You         can read more about the specifics of this algorithm [here](         https://opensearch.org/docs/1.0/search-plugins/knn/approximate-knn/).

#### Backwards Compatibility
V2 and V1 Knowledge Bases are entirely separate and not backwards compatible. Users who         have existing V1 knowledge bases will need to migrate their data to V2 knowledge bases.

Query Relevant Chunks

Delete Knowledge Base Data Source Connection

Submit Upload Job with local files

### Description
List all uploads for a knowledge base.

### Details
This API can be used to list all uploads that have been created for a knowledge base.         This API will return the details of all uploads including their IDs, their statuses, the         data source configs they use, the chunking strategy configs they use, and the timestamps for         their creation and last-updated time.

#### Backwards Compatibility
V2 and V1 Knowledge Bases are entirely separate and not backwards compatible. Users who         have existing V1 knowledge bases will need to migrate their data to V2 knowledge bases.

Submit Upload Job

Get Upload Job

Cancel Upload Job

### Description
List all artifacts tracked by a knowledge base.

### Details
This API can be used to list all artifacts that are currently tracked in a knowledge base.         This API will return the details of all artifacts including their IDs, names, the source they         originated from, their current upload statuses, and the timestamps for their creation and
last-updated time.

This list should be consistent with the state of the data source at the time of start of the         latest upload. If the state is not consistent, create a new upload to update the knowledge         base to reflect the latest state of the data source.

List Tracked Artifacts

### Description
Gets the details of an artifact tracked by a knowledge base.

### Details
This API can be used to get information about a single artifact by ID. This response will         contain much more detail about the artifact than show in the         [List Artifacts API](https://scale-egp.readme.io/reference/list_knowledge_base_artifacts_v2)         call. To use this API, pass in the `knowledge_base_id` and `artifact_id` that were returned         from your [List Artifacts API](         https://scale-egp.readme.io/reference/list_knowledge_base_artifacts_v2) call as path         parameters.

#### Compatibility with V1
V2 and V1 Knowledge Bases are entirely separate and not backwards compatible. Users who         have existing V1 knowledge bases will need to migrate their data to V2 knowledge bases.

Get Tracked Artifact Details

Delete Locally Stored Artifact

Patch Artifact Information

### Description
List chunks for a specific artifact. This API supports pagination and reads only from the data store to allow
querying chunks that are failed as well to enumerate all chunks of a specific artifact.

List Chunks of Artifacts with Pagination

Create chunk for local chunk artifacts

Get Single Chunk Information and status

Update Single Chunk data for local artifact

Delete Single Chunk from Local Artifact

Test Knowledge Base Data Source credentials

Batch Delete Locally Stored Artifacts

### Description
Lists all upload schedules accessible to the user.

### Details
This API can be used to list upload schedules. If a user has access to multiple accounts, all upload schedules from all accounts the user is associated with will be returned.

List Upload Schedules

### Description
Creates a upload schedule

### Details
This API can be used to create a upload schedule. To use this API, review the request schema and pass in all fields that are required to create a upload schedule.

Create Upload Schedule

### Description
Gets the details of a upload schedule

### Details
This API can be used to get information about a single upload schedule by ID. To use this API, pass in the `id` that was returned from your Create Upload Schedule API call as a path parameter.

Review the response schema to see the fields that will be returned.

Get Upload Schedule

### Description
Deletes a upload schedule

### Details
This API can be used to delete a upload schedule by ID. To use this API, pass in the `id` that was returned from your Create Upload Schedule API call as a path parameter.

Delete Upload Schedule

### Description
Updates a upload schedule

### Details
This API can be used to update the upload schedule that matches the ID that was passed in as a path parameter. To use this API, pass in the `id` that was returned from your Create Upload Schedule API call as a path parameter.

Review the request schema to see the fields that can be updated.

Update Upload Schedule

### Description
Sorts a list of text chunks by similarity against a given query string.

### Details
Use this API endpoint to rank which text chunks provide the most relevant responses to a         given a query string.

This is useful for stuffing chunks into a prompt where order may matter or for filtering out         less relevant chunks according to the ranking strategy. For example, this API may be useful         when doing retrieval augment generation (RAG). Sometimes vector store [similarity search](         https://scale-egp.readme.io/reference/query_vector_store) does not always return the best         ranking of text chunks, since this is heavily dependent on embedding generation. This API         endpoint can act as a post-processing step to re-sort the given chunks using more complex         strategies that may outperform vector search, and then filter only the top-k most relevant         chunks to stuff into the prompt for RAG.

### Restrictions and Limits
Ranking can be a very intensive and slow process depending on methodology where duration         scales with number of chunks. For best performance, we recommend ranking less than 640 chunks         at a time, and you may see a decrease in performance as the number of chunks ranked increases.

Rank Chunks

### Description
Executes one Agent inference step. Given a list of messages and a list of tools to ask         for help from, the Agent will either respond with a final answer directly or ask the user to         execute a tool to provide more information.

### Details
An Agent is a component that utilizes a Language Model (LLM) as an interpreter and decision         maker. Unlike asking an LLM for a direct response, communicating with an agent consists of a         running dialogue where an agent can optionally ask the user to execute specialized tools for         specific tasks, such as calculations, web searches, or accessing custom data from private         knowledge bases.

An agent is designed to be stateless, emitting outputs one step at a time. This means that         client-side applications are responsible for managing message history, tool execution,         and responses. This grants users greater flexibility to write and execute custom tools         and maintain explicit control over their message history.

#### Message Types
- `User Message`: A message from the user to the agent.
- `System Message`: An informational text message from the system to guide the agent. It is         not a user message or agent message because it did not come from either entity.
- `Agent Message`: A message from the agent to the client. It will contain either a final         answer as `content` or a request for the user to execute a tool as a `tool_request`.
- `Tool Message`: A message from the user to the agent that contains the output of a tool         execution. The tool message will be processed by the agent and the agent will respond with         either a final answer or another tool request.

#### Agent Instructions
Instructions are used to guide the agent's decision making process and output generation.

Good prompt engineering is crucial to getting performant results from the agent. If you are         having trouble getting the agent to perform well, try writing more specific instructions         before trying more expensive techniques such as swapping in other models or finetuning the         underlying LLM.

For example, the default instructions we set for the agent are the following:

> You are an AI assistant that helps users with their questions. You can answer questions         directly or acquire information from any of the attached tools to assist you. Always answer         the user's most recent query to the best of your knowledge.<br>
> When asked about what tools are available, you must list each attached tool's name         and description. When asked about what you can do, mention that in addition to your normal         capabilities, you can also use the attached tools by listing their names and descriptions.         You cannot use any other tools other than the ones provided to you explicitly.


### Restrictions and Limits
**Message Limits**:
  - The message list is not limited by length, but by the context limit of the underlying           language model. If you are getting an error regarding the underlying model's context           limit, try using a memory strategy to condense the input messages.

**Model Restrictions**:
  - Currently, only closed source models like GPT and Claude are supported due to the           limitations of open source models when it comes to tool selection, generating tool           arguments in valid JSON, and planning out multi-step tool execution. Specialized           fine-tuning will likely be required to make open source models compatible with agents.

Evaluations

Tracing

​Create Evaluation

​Navigate to the Evaluate tab and click “Create evaluation”.

​Add Evaluation Details

​Add an LLM Judge

​Configure LLM Judge

​Create Evaluation

​View Evaluation Results

​Data

​Overview

Create Evaluation

Navigate to the Evaluate tab and click “Create evaluation”.

Add Evaluation Details

Add an LLM Judge

Configure LLM Judge

Create Evaluation

View Evaluation Results

Data

Overview