Why Use Train?
Each cloud ML platform has its own job submission API, container registry, storage format, and authentication model. Teams that want cloud flexibility end up maintaining separate training pipelines per cloud, or locking into one provider early. Train solves this by sitting in front of all three clouds with a single API surface. You submit a job with an image and a command; Train takes care of pushing the image to the right registry, mounting your storage, and translating the job config into the cloud’s native format. Your training code doesn’t know or care which cloud it’s running on.- Single API for all backends:
vertex-ai,sagemaker, andazure-mlare selectable per job - Unified image registry: Push Docker images once to the registry proxy; Train resolves them to the correct cloud registry at submission time
- Cross-cloud storage mounts: Mount GCS, S3, or Azure Blob Storage with
STORAGE_MOUNT_*_PATHenv vars, using the same interface everywhere - Consistent job lifecycle:
PENDING → IN_PROGRESS → COMPLETED / FAILED / CANCELED / EXPIREDacross all backends
How It Works
Train Service
The Train service has two components: a gateway that accepts job submissions and returns a job ID, and a background poller that tracks running jobs and updates their status.Supported Backends
| Backend | Cloud service | {backend} value |
|---|---|---|
| Vertex AI | GCP Custom Jobs | vertex-ai |
| SageMaker | AWS Training Jobs | sagemaker |
| Azure ML | Azure ML Command Jobs | azure-ml |
Container Registry
Each workspace has a Docker-compatible registry. You push images to it; Train ensures the exact image you pushed is what runs on the cloud backend, even if the tag is updated later.Storage Mounts
Training containers on all clouds receiveSTORAGE_MOUNT_0_PATH, STORAGE_MOUNT_1_PATH, etc. as environment variables pointing to mounted directories. Your training script reads from os.environ["STORAGE_MOUNT_0_PATH"], with no cloud-specific paths or SDKs needed.
Source URIs use cloud-native formats:
| Cloud | URI format |
|---|---|
| GCP | gs://bucket/path |
| AWS | s3://bucket/path |
| Azure | azureml://datastores/name/paths/path |
Common Use Cases
- Improve agent accuracy on your domain: Fine-tune a foundation model on your proprietary data so your agents produce more reliable, on-brand responses and make fewer errors on the tasks that matter to your business.
- Optimize agent performance with feedback: Use evaluations and human feedback collected in SGP to run reinforcement learning pipelines that directly improve how your agents behave over time.
- Reduce inference cost and latency: Distill a large general-purpose model into a smaller one trained specifically for your use case, so agents respond faster and cost less to run at scale.
Authentication
Train uses the same authentication as other SGP services:account_id:api_key) during docker login, then issues a short-lived token for push and pull operations.

