Introduction to Train

Train is SGP’s multi-cloud training job service. It provides a unified REST API for submitting GPU training jobs to Vertex AI (GCP), SageMaker (AWS), and Azure ML, handling cloud-specific job submission, image resolution, storage mounting, and status tracking transparently.

Why Use Train?

Each cloud ML platform has its own job submission API, container registry, storage format, and authentication model. Teams that want cloud flexibility end up maintaining separate training pipelines per cloud, or locking into one provider early. Train solves this by sitting in front of all three clouds with a single API surface. You submit a job with an image and a command; Train takes care of pushing the image to the right registry, mounting your storage, and translating the job config into the cloud’s native format. Your training code doesn’t know or care which cloud it’s running on.

Single API for all backends: vertex-ai, sagemaker, and azure-ml are selectable per job
Unified image registry: Push Docker images once to the registry proxy; Train resolves them to the correct cloud registry at submission time
Cross-cloud storage mounts: Mount GCS, S3, or Azure Blob Storage with STORAGE_MOUNT_*_PATH env vars, using the same interface everywhere
Consistent job lifecycle: PENDING → IN_PROGRESS → COMPLETED / FAILED / CANCELED / EXPIRED across all backends

How It Works

Train Service

The Train service has two components: a gateway that accepts job submissions and returns a job ID, and a background poller that tracks running jobs and updates their status.

Supported Backends

Backend	Cloud service	`{backend}` value
Vertex AI	GCP Custom Jobs	`vertex-ai`
SageMaker	AWS Training Jobs	`sagemaker`
Azure ML	Azure ML Command Jobs	`azure-ml`

Container Registry

Each workspace has a Docker-compatible registry. You push images to it; Train ensures the exact image you pushed is what runs on the cloud backend, even if the tag is updated later.

Storage Mounts

Training containers on all clouds receive STORAGE_MOUNT_0_PATH, STORAGE_MOUNT_1_PATH, etc. as environment variables pointing to mounted directories. Your training script reads from os.environ["STORAGE_MOUNT_0_PATH"], with no cloud-specific paths or SDKs needed. Source URIs use cloud-native formats:

Cloud	URI format
GCP	`gs://bucket/path`
AWS	`s3://bucket/path`
Azure	`azureml://datastores/name/paths/path`

Common Use Cases

Improve agent accuracy on your domain: Fine-tune a foundation model on your proprietary data so your agents produce more reliable, on-brand responses and make fewer errors on the tasks that matter to your business.
Optimize agent performance with feedback: Use evaluations and human feedback collected in SGP to run reinforcement learning pipelines that directly improve how your agents behave over time.
Reduce inference cost and latency: Distill a large general-purpose model into a smaller one trained specifically for your use case, so agents respond faster and cost less to run at scale.

Authentication

Train uses the same authentication as other SGP services:

x-api-key: <your_api_key>
x-selected-account-id: <your_account_id>

The registry proxy accepts Docker Basic auth (account_id:api_key) during docker login, then issues a short-lived token for push and pull operations.

Getting Started

To begin using Train, you’ll need a Scale account with SGP access and credentials for at least one cloud backend.

Getting Started

Submit your first training job

Custom Images

Push and use your own Docker images

Storage Mounts

Mount training data and output directories across clouds

Getting Started

Document Understanding

OCR

Workflows

Training

Introduction to Train

Why Use Train?

How It Works

Train Service

Supported Backends

Container Registry

Storage Mounts

Common Use Cases

Authentication

Getting Started

Getting Started

Custom Images

Storage Mounts

​Why Use Train?

​How It Works

​Train Service

​Supported Backends

​Container Registry

​Storage Mounts

​Common Use Cases

​Authentication

​Getting Started

Getting Started

Custom Images

Storage Mounts

Why Use Train?

How It Works

Train Service

Supported Backends

Container Registry

Storage Mounts

Common Use Cases

Authentication

Getting Started