Data sources are used to create Knowledge bases on the GenAI platform. Knowledge bases can be connected to data sources through either a one time or periodic sync configuration.

When a knowledge base upload is created from a data source, it will read data from it, extract text from relevant files, split it into chunks, embed the chunks, and store the embeddings in a vector database for future retrieval.

Connecting a Data Source

Data sources can be connected here inside the platform. How they are connected depends on the data source that the user chooses. Currently, GenAI Platform supports the following sources:

  • AWS S3
  • Google Drive
  • Azure Blob Storage
  • MS SharePoint

See below for how to connect a specific data source.

AWS S3

In order to connect a S3 bucket, you need to:

  1. [In AWS] Configure an IAM Role and Trust Relationship.
  2. [In AWS] Configure an IAM Policy attached to that role to grant S3 privileges.
  3. [In SGP] Connect your data source.

Configuring an IAM Role

  1. Navigate to the IAM Service on the AWS Console.
  2. Select Roles on the left hand navigation pane.
  3. Select the following configurations:
  1. Create role for AWS Account
  2. Another AWS account. the Scale AWS Account ID is: 307185671274
  3. Check the “Require external ID” box
    1. External ID: (located in SGP below)

The goal of AWS’s external ID is to provide protection against Scale accidentally assuming the role of a different external account. Given this, we map it to your account ID, and do not allow modification of it. You can find more details here.

  1. Click Next, and you should be able to see a Permissions tab. This maps to the Policies that we will create in the next section. If you already have a policy that matches, then you can attach it directly and skip the next section. Otherwise, just click Next again.
  2. At the last step, give the role the name ScaleAI-Integration. Below that pane, you should be able to see the generated IAM Trust Policy for the Role.

Sample IAM Role Trust Relationship

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Effect": "Allow",
			"Principal": {
				"AWS": "arn:aws:iam::307185671274:root"
			},
			"Action": "sts:AssumeRole",
			"Condition": {
				"StringEquals": {
					"sts:ExternalId": "<ScaleAccountID>"
				}
			}
		}
	]

Configuring an IAM Policy for the Role

  1. Navigate back to the Roles tab and click on the ScaleAI-Integration role.
  1. Go to the block called Permissions Policies. On the right-hand side, select Add Permissions, and then select Create Inline Policy.
  1. Under Select a Service, select S3. SGP will need 2 permissions in order to properly read the files in your bucket:
    1. ListBucket, under the List section
    2. GetObject, under the Read section
  1. Next, look for the Resources section below the Permission section. These permissions will determine the buckets and objects Scale’s GenPlatform can read.

For best results with our ingestion pipelines, we recommend specifying certain buckets and then allowing all objects within that bucket to be read.

  1. To select a bucket to give access to, click Add ARNs in the bucket row.
  1. To select which objects in that bucket to give access to, click Add ARNs in the object row. For best results, we recommend allowing the entire bucket to be read.
  1. After selecting which buckets and objects to give permission to, name the Policy any name, and then select Create Policy.

Connect Your Data Source

  1. Navigate to Data Sources in the GenAI Platform.
  2. Click New Data Source and select AWS S3
  1. Fill out the following:
  1. AWS Account ID: Your AWS Account ID
  2. S3 Bucket Name: The name of the S3 Bucket where the dats is stored
    1. Note: If you have multiple S3 Buckets, you’ll have to connect them as separate data sources.
  3. S3 Bucket AWS Region: The region your bucket is configured to
  4. Connect your data source. This can take a while, so be patient while the screen is loading.

Google Drive

In order to connect a Google Drive folder, you need to:

  1. [In Google Cloud] Set up a service account.
  2. [In Google Cloud] Enable the Google Drive API.
  3. [In Google Cloud] Create a Key.
  4. [In Google Drive] Share the Drive folder with the service account.
  5. [In SGP] Connect your data source.

Set up a service account

  1. Navigate to the Google Cloud Console.
  2. Create a new project. Or select an existing project.
  1. Enable API access for this project.
  1. Navigate to Service Account on the left side menu. Click Create Service Account.
  1. Create the Service account

    1. Fill in the service account details (up to you).
    1. Grant service account roles and access (up to you). We recommend the Owner role.

Enable the Google Drive API

  1. Navigate to the API Console.
  2. Select the project you created earlier.
  1. Open side menu and select APIs & services, and then select Library.
  2. Search for the Google Drive Api in the Library.
  1. Click ENABLE.

See more detailed instructions from Google here.

Create a Key

  1. Select the created service account.
  1. Navigate to the KEYS tab.
  1. Select ADD KEY, and Create a new JSON Key.
  1. The Private Key will be saved to your computer. The credentials will be used later when we connect our data source.

Share the Drive folder with the service account

Share the drive folder you want to connect with the service account you just created.

Connect your data source

Information from the Key created

We will need the following fields from the JSON Below: client_email, client_id , private_key , token_uri

For the private key, be sure to copy and paste the entire string in the file, inclusive of the sections “BEGIN PRIVATE KEY” and “END PRIVATE KEY”.

{
  "type": "service_account",
  "project_id": "sgp-g-drive-data-source-test",
  "private_key_id": "9ba310c026227509efd6fa5917d1b8d971791811",
  "private_key": "-----BEGIN PRIVATE KEY-----\*****redacted*****\n-----END PRIVATE KEY-----\n",
  "client_email": "scaleai-service-account@sgp-g-drive-data-source-test.iam.gserviceaccount.com",
  "client_id": "118190969327709508303",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://oauth2.googleapis.com/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/scaleai-service-account%40sgp-g-drive-data-source-test.iam.gserviceaccount.com",
  "universe_domain": "googleapis.com"
}

Google Drive Folder ID

Connecting Data Source

With the information above, you can connect your data source.

ScaleAI will only store an encrypted version of your credentials to be used to scrape from the folders

Azure Blob Storage

In order to connect an Azure Storage Container, you need to

  1. Identify the Container URL.
  2. Generate a shared access signature (SAS) and grant SGP restricted access to the Azure Storage container.
  3. Connect your data source.

Locate the Container URL

  1. Go to Microsoft Azure and locate the containers you have.
  1. Select the container you have access to. If you don’t have a container yet, create a container and upload data into that container. Click on the three dots on the right-hand side and select Container Properties.
  1. Locate the container_url on this screen.

Generate SAS

  1. Navigate back to the list of containers.
  1. Click on Generate SAS. This is a URI that grants Scale Generative Platform restricted Access to the selected Azure Storage container. In order to access the data in this container, we will need Read and List privileges.
  1. After generation, you will see a Blob SAS Token and Blob SAS URL. To connect a data source, you will need the Blob SAS Token.

Connect your data source

Add the Blob SAS token and Container URL from previous steps and connect your data source.

MS SharePoint

In order to connect a SharePoint data source, you need to:

  1. Create a new App Registration.
  2. Grant Sharepoint read access to the App Registration.
  3. Create a Client Secret for the App Registration.
  4. Obtain the SharePoint site Id.
  5. Connect the Data Source.

Create a new App Registration

  1. Navigate to Microsoft Entra ID in the Azure Portal under the organization that owns the SharePoint site you would like to use as a data source.
  2. Go to App registrations in the lefthand sidebar and click New registration.
  1. Enter a name for the application and select Accounts in any organizational directory for Supported account types.
  2. After creating the new registration, navigate to the registration’s Overview page to find the respective client_id and tenant_id.

Grant SharePoint read access to the App Registration

  1. Navigate to API Permissions in the sidebar.
  1. Click Add permission.
    1. Under Select an API choose Microsoft Graph.
    2. Select Application permissions for permissions type.
    3. Check Sites.Read.All under Sites.

Note: You may need to reach out to an organization admin to have this permission request approved.

Create a Client Secret for the App Registration

  1. Navigate to Certificates and Secrets in the sidebar.
  1. Click New client secret and choose a name and expiration date.

Note: you will need to update the client_secret for the data source after the expiration date

Obtain the SharePoint site ID

  1. Navigate to the SharePoint admin control center and find the site that you want to use as a data source.
  2. Select the site, extract the **site_id** from the URL. This will appear at the end of the URL path after **/SiteDetails**.

Connect the Data Source

Populate with the Client Secret, Client ID, Tenant ID, and Site ID from previous steps.