Data sources are used to create Knowledge bases on the GenAI platform. Knowledge bases can be connected to data sources through either a one time or periodic sync configuration. When a knowledge base upload is created from a data source, it will read data from it, extract text from relevant files, split it into chunks, embed the chunks, and store the embeddings in a vector database for future retrieval.

Connecting a Data Source

Data sources can be connected here inside the platform. How they are connected depends on the data source that the user chooses. Currently, GenAI Platform supports the following sources:
  • AWS S3
  • Google Drive
  • Azure Blob Storage
  • MS SharePoint
See below for how to connect a specific data source.

AWS S3

In order to connect a S3 bucket, you need to:
  1. [In AWS] Configure an IAM Role and Trust Relationship.
  2. [In AWS] Configure an IAM Policy attached to that role to grant S3 privileges.
  3. [In SGP] Connect your data source.

Configuring an IAM Role

  1. Navigate to the IAM Service on the AWS Console.
  2. Select Roles on the left hand navigation pane.
  3. Select the following configurations:
  1. Create role for AWS Account
  2. Another AWS account. the Scale AWS Account ID is: 307185671274
  3. Check the “Require external ID” box
    1. External ID: (located in SGP below)
External ID in SGP for AWS account
The goal of AWS’s external ID is to provide protection against Scale accidentally assuming the role of a different external account. Given this, we map it to your account ID, and do not allow modification of it. You can find more details here.
  1. Click Next, and you should be able to see a Permissions tab. This maps to the Policies that we will create in the next section. If you already have a policy that matches, then you can attach it directly and skip the next section. Otherwise, just click Next again.
  2. At the last step, give the role the name ScaleAI-Integration. Below that pane, you should be able to see the generated IAM Trust Policy for the Role.
Sample IAM Role Trust Relationship

Sample IAM Role Trust Relationship

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Effect": "Allow",
			"Principal": {
				"AWS": "arn:aws:iam::307185671274:root"
			},
			"Action": "sts:AssumeRole",
			"Condition": {
				"StringEquals": {
					"sts:ExternalId": "<ScaleAccountID>"
				}
			}
		}
	]

Configuring an IAM Policy for the Role

  1. Navigate back to the Roles tab and click on the ScaleAI-Integration role.
ScaleAI-Integration role
  1. Go to the block called Permissions Policies. On the right-hand side, select Add Permissions, and then select Create Inline Policy.
Create Inline Policy
  1. Under Select a Service, select S3. SGP will need 2 permissions in order to properly read the files in your bucket:
    1. ListBucket, under the List section
    2. GetObject, under the Read section
S3 permissions ListBucket and GetObject
  1. Next, look for the Resources section below the Permission section. These permissions will determine the buckets and objects Scale’s GenPlatform can read.
Resources section for S3 permissions
For best results with our ingestion pipelines, we recommend specifying certain buckets and then allowing all objects within that bucket to be read.
  1. To select a bucket to give access to, click Add ARNs in the bucket row.
Add ARNs for buckets
  1. To select which objects in that bucket to give access to, click Add ARNs in the object row. For best results, we recommend allowing the entire bucket to be read.
Add ARNs for objects
  1. After selecting which buckets and objects to give permission to, name the Policy any name, and then select Create Policy.

Connect Your Data Source

  1. Navigate to Data Sources in the GenAI Platform.
  2. Click New Data Source and select AWS S3
New Data Source - AWS S3
  1. Fill out the following:
AWS S3 Data Source form
  1. AWS Account ID: Your AWS Account ID
  2. S3 Bucket Name: The name of the S3 Bucket where the dats is stored
    1. Note: If you have multiple S3 Buckets, you’ll have to connect them as separate data sources.
  3. S3 Bucket AWS Region: The region your bucket is configured to
  4. Connect your data source. This can take a while, so be patient while the screen is loading.

Google Drive

In order to connect a Google Drive folder, you need to:
  1. [In Google Cloud] Set up a service account.
  2. [In Google Cloud] Enable the Google Drive API.
  3. [In Google Cloud] Create a Key.
  4. [In Google Drive] Share the Drive folder with the service account.
  5. [In SGP] Connect your data source.

Set up a service account

  1. Navigate to the Google Cloud Console.
  2. Create a new project. Or select an existing project.
Create or Select Project
  1. Enable API access for this project.
Enable API Access
  1. Navigate to Service Account on the left side menu. Click Create Service Account.
Create Service Account
  1. Create the Service account
    1. Fill in the service account details (up to you).
    Service Account Details
    1. Grant service account roles and access (up to you). We recommend the Owner role.
    Grant Service Account Roles

Enable the Google Drive API

  1. Navigate to the API Console.
  2. Select the project you created earlier.
APIs & Services Library
  1. Open side menu and select APIs & services, and then select Library.
  2. Search for the Google Drive Api in the Library.
Google Drive API Search
  1. Click ENABLE.
Enable Google Drive API
See more detailed instructions from Google here.

Create a Key

  1. Select the created service account.
Select Service Account
  1. Navigate to the KEYS tab.
KEYS Tab
  1. Select ADD KEY, and Create a new JSON Key.
Add Key - Create JSON Key
  1. The Private Key will be saved to your computer. The credentials will be used later when we connect our data source.
Private Key Saved

Share the Drive folder with the service account

Share the drive folder you want to connect with the service account you just created.

Connect your data source

Information from the Key created

We will need the following fields from the JSON Below: client_email, client_id , private_key , token_uri
For the private key, be sure to copy and paste the entire string in the file, inclusive of the sections “BEGIN PRIVATE KEY” and “END PRIVATE KEY”.
{
  "type": "service_account",
  "project_id": "sgp-g-drive-data-source-test",
  "private_key_id": "9ba310c026227509efd6fa5917d1b8d971791811",
  "private_key": "-----BEGIN PRIVATE KEY-----\*****redacted*****\n-----END PRIVATE KEY-----\n",
  "client_email": "scaleai-service-account@sgp-g-drive-data-source-test.iam.gserviceaccount.com",
  "client_id": "118190969327709508303",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://oauth2.googleapis.com/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/scaleai-service-account%40sgp-g-drive-data-source-test.iam.gserviceaccount.com",
  "universe_domain": "googleapis.com"
}

Google Drive Folder ID

Google Drive Folder ID

Connecting Data Source

With the information above, you can connect your data source.
Connecting Data Source
ScaleAI will only store an encrypted version of your credentials to be used to scrape from the folders

Azure Blob Storage

In order to connect an Azure Storage Container, you need to
  1. Identify the Container URL.
  2. Generate a shared access signature (SAS) and grant SGP restricted access to the Azure Storage container.
  3. Connect your data source.

Locate the Container URL

  1. Go to Microsoft Azure and locate the containers you have.
Azure Containers
  1. Select the container you have access to. If you don’t have a container yet, create a container and upload data into that container. Click on the three dots on the right-hand side and select Container Properties.
Container Properties
  1. Locate the container_url on this screen.
Container URL

Generate SAS

  1. Navigate back to the list of containers.
List of Containers
  1. Click on Generate SAS. This is a URI that grants Scale Generative Platform restricted Access to the selected Azure Storage container. In order to access the data in this container, we will need Read and List privileges.
Generate SAS
  1. After generation, you will see a Blob SAS Token and Blob SAS URL. To connect a data source, you will need the Blob SAS Token.
Blob SAS Token and URL

Connect your data source

Connect Data Source
Add the Blob SAS token and Container URL from previous steps and connect your data source.

MS SharePoint

In order to connect a SharePoint data source, you need to:
  1. Create a new App Registration.
  2. Grant Sharepoint read access to the App Registration.
  3. Create a Client Secret for the App Registration.
  4. Obtain the SharePoint site Id.
  5. Connect the Data Source.

Create a new App Registration

  1. Navigate to Microsoft Entra ID in the Azure Portal under the organization that owns the SharePoint site you would like to use as a data source.
  2. Go to App registrations in the lefthand sidebar and click New registration.
New Registration
  1. Enter a name for the application and select Accounts in any organizational directory for Supported account types.
  2. After creating the new registration, navigate to the registration’s Overview page to find the respective client_id and tenant_id.
App Overview

Grant SharePoint read access to the App Registration

  1. Navigate to API Permissions in the sidebar.
API Permissions
  1. Click Add permission.
    1. Under Select an API choose Microsoft Graph.
    2. Select Application permissions for permissions type.
    3. Check Sites.Read.All under Sites.
Add Permission
Note: You may need to reach out to an organization admin to have this permission request approved.

Create a Client Secret for the App Registration

  1. Navigate to Certificates and Secrets in the sidebar.
Certificates and Secrets
  1. Click New client secret and choose a name and expiration date.
Note: you will need to update the client_secret for the data source after the expiration date

Obtain the SharePoint site ID

  1. Navigate to the SharePoint admin control center and find the site that you want to use as a data source.
  2. Select the site, extract the **site_id** from the URL. This will appear at the end of the URL path after **/SiteDetails**.
SharePoint Site ID

Connect the Data Source

Populate with the Client Secret, Client ID, Tenant ID, and Site ID from previous steps.
Connect Data Source