Skip to main content
Train is SGP’s multi-cloud training job service. It provides a unified REST API for submitting GPU training jobs to Vertex AI (GCP), SageMaker (AWS), and Azure ML, handling cloud-specific job submission, image resolution, storage mounting, and status tracking transparently.

Why Use Train?

Each cloud ML platform has its own job submission API, container registry, storage format, and authentication model. Teams that want cloud flexibility end up maintaining separate training pipelines per cloud, or locking into one provider early. Train solves this by sitting in front of all three clouds with a single API surface. You submit a job with an image and a command; Train takes care of pushing the image to the right registry, mounting your storage, and translating the job config into the cloud’s native format. Your training code doesn’t know or care which cloud it’s running on.
  • Single API for all backends: vertex-ai, sagemaker, and azure-ml are selectable per job
  • Unified image registry: Push Docker images once to the registry proxy; Train resolves them to the correct cloud registry at submission time
  • Cross-cloud storage mounts: Mount GCS, S3, or Azure Blob Storage with STORAGE_MOUNT_*_PATH env vars, using the same interface everywhere
  • Consistent job lifecycle: PENDING → IN_PROGRESS → COMPLETED / FAILED / CANCELED / EXPIRED across all backends

How It Works

Train Service

The Train service has two components: a gateway that accepts job submissions and returns a job ID, and a background poller that tracks running jobs and updates their status.

Supported Backends

BackendCloud service{backend} value
Vertex AIGCP Custom Jobsvertex-ai
SageMakerAWS Training Jobssagemaker
Azure MLAzure ML Command Jobsazure-ml

Container Registry

Each workspace has a Docker-compatible registry. You push images to it; Train ensures the exact image you pushed is what runs on the cloud backend, even if the tag is updated later.

Storage Mounts

Training containers on all clouds receive STORAGE_MOUNT_0_PATH, STORAGE_MOUNT_1_PATH, etc. as environment variables pointing to mounted directories. Your training script reads from os.environ["STORAGE_MOUNT_0_PATH"], with no cloud-specific paths or SDKs needed. Source URIs use cloud-native formats:
CloudURI format
GCPgs://bucket/path
AWSs3://bucket/path
Azureazureml://datastores/name/paths/path

Common Use Cases

  • Improve agent accuracy on your domain: Fine-tune a foundation model on your proprietary data so your agents produce more reliable, on-brand responses and make fewer errors on the tasks that matter to your business.
  • Optimize agent performance with feedback: Use evaluations and human feedback collected in SGP to run reinforcement learning pipelines that directly improve how your agents behave over time.
  • Reduce inference cost and latency: Distill a large general-purpose model into a smaller one trained specifically for your use case, so agents respond faster and cost less to run at scale.

Authentication

Train uses the same authentication as other SGP services:
x-api-key: <your_api_key>
x-selected-account-id: <your_account_id>
The registry proxy accepts Docker Basic auth (account_id:api_key) during docker login, then issues a short-lived token for push and pull operations.

Getting Started

To begin using Train, you’ll need a Scale account with SGP access and credentials for at least one cloud backend.