This guide walks you through submitting a training job using the Train API.
Prerequisites
Before you begin, ensure you have:
- A valid Scale account with SGP access
- Your API key and account ID:
export SGP_API_KEY="your_api_key"
export SGP_ACCOUNT_ID="your_account_id"
- A Docker image pushed to the workspace registry (see Custom Images)
Submit a Job
Jobs are submitted to POST /v1/{backend}/jobs. The {backend} is one of vertex-ai, sagemaker, or azure-ml. The job_config structure mirrors each cloud’s native job API.
Vertex AI
SageMaker
Azure ML
curl -X POST "https://train.your-sgp-deployment-url/v1/vertex-ai/jobs" \
-H "x-api-key: $SGP_API_KEY" \
-H "x-selected-account-id: $SGP_ACCOUNT_ID" \
-H "Content-Type: application/json" \
-d '{
"name": "mnist-training",
"job_config": {
"worker_pool_specs": [{
"machine_spec": {
"machine_type": "g2-standard-8",
"accelerator_type": "NVIDIA_L4",
"accelerator_count": 1
},
"replica_count": 1,
"container_spec": {
"image_uri": "docker-registry.your-sgp-deployment-url/pytorch-gpu:v1",
"command": ["python3", "train.py"],
"args": ["--epochs", "3"]
}
}],
"scheduling": {"timeout": "86400s"}
},
"storage_mounts": [
{"source_uri": "gs://my-bucket/mnist-data", "mount_path": "/mnt/data", "read_only": true},
{"source_uri": "gs://my-bucket/output/run-1", "mount_path": "/mnt/output", "read_only": false}
]
}'
{
"id": "87cae349-cd5f-4202-b674-f1652880a8f5",
"name": "mnist-training",
"job_type": "vertex_ai",
"status": "PENDING",
"cloud_job_identifier": {
"job_name": "projects/my-project/locations/us-east1/customJobs/1234567890"
},
"storage_mounts": [...],
"account_id": "693af856dde70fd6614ab64c",
"created_at": "2026-03-11T08:12:25.542507Z",
"updated_at": "2026-03-11T08:12:25.542507Z"
}
worker_pool_specs is an array; add more pools for distributed training. command and args are separate arrays.curl -X POST "https://train.your-sgp-deployment-url/v1/sagemaker/jobs" \
-H "x-api-key: $SGP_API_KEY" \
-H "x-selected-account-id: $SGP_ACCOUNT_ID" \
-H "Content-Type: application/json" \
-d '{
"name": "mnist-training",
"job_config": {
"image": "docker-registry.your-sgp-deployment-url/pytorch-gpu:v1",
"command": ["python3", "/app/train.py"],
"resource_config": {
"instance_count": 1,
"instance_type": "ml.g4dn.xlarge",
"volume_size_in_gb": 50
},
"hyperparameters": {"learning_rate": "0.001"}
},
"storage_mounts": [
{"source_uri": "s3://my-bucket/mnist-data", "mount_path": "/mnt/data", "read_only": true},
{"source_uri": "s3://my-bucket/output/run-1", "mount_path": "/mnt/output", "read_only": false}
]
}'
{
"id": "3f2a1b9c-88de-4a12-b901-c2e5f7d04321",
"name": "mnist-training",
"job_type": "sagemaker",
"status": "PENDING",
"cloud_job_identifier": {
"training_job_name": "mnist-training-2026-03-11-08-12-25"
},
"storage_mounts": [...],
"account_id": "693af856dde70fd6614ab64c",
"created_at": "2026-03-11T08:12:25.542507Z",
"updated_at": "2026-03-11T08:12:25.542507Z"
}
instance_type determines GPU: ml.g4dn.xlarge = T4, ml.p4de.24xlarge = 8×A100 80GB.curl -X POST "https://train.your-sgp-deployment-url/v1/azure-ml/jobs" \
-H "x-api-key: $SGP_API_KEY" \
-H "x-selected-account-id: $SGP_ACCOUNT_ID" \
-H "Content-Type: application/json" \
-d '{
"name": "mnist-training",
"job_config": {
"command": "python train.py --epochs 3",
"environment": {
"image": "docker-registry.your-sgp-deployment-url/pytorch-gpu:v1"
},
"compute": "gpu-cluster",
"resources": {
"instance_type": "Standard_NC24ads_A100_v4",
"instance_count": 1
},
"environment_variables": {"LEARNING_RATE": "0.001"}
},
"storage_mounts": [
{"source_uri": "azureml://datastores/traindata/paths/mnist", "mount_path": "/mnt/data", "read_only": true},
{"source_uri": "azureml://datastores/trainoutput/paths/run-1", "mount_path": "/mnt/output", "read_only": false}
]
}'
{
"id": "a1b2c3d4-ef56-7890-abcd-ef1234567890",
"name": "mnist-training",
"job_type": "azure_ml",
"status": "PENDING",
"cloud_job_identifier": {
"job_name": "heroic_grass_22zggyxsxp"
},
"storage_mounts": [...],
"account_id": "693af856dde70fd6614ab64c",
"created_at": "2026-03-11T08:12:25.542507Z",
"updated_at": "2026-03-11T08:12:25.542507Z"
}
command is a single shell string (not an array). compute can be a named cluster or "serverless".
Job Response
All backends return the same job object shape:
Unique job identifier. Use this to poll status or cancel the job.
Current job state. Transitions through PENDING → IN_PROGRESS → COMPLETED / FAILED / CANCELED / EXPIRED.
The cloud provider’s native job identifier. Shape varies by backend: job_name for Vertex AI and Azure ML, training_job_name for SageMaker.
Populated on failure. Contains failure_reason with the error message from the cloud provider.
Check Status
curl "https://train.your-sgp-deployment-url/v1/jobs/$JOB_ID" \
-H "x-api-key: $SGP_API_KEY" \
-H "x-selected-account-id: $SGP_ACCOUNT_ID"
Other Endpoints
List jobs:
curl "https://train.your-sgp-deployment-url/v1/jobs" \
-H "x-api-key: $SGP_API_KEY" \
-H "x-selected-account-id: $SGP_ACCOUNT_ID"
Cancel a job:
curl -X POST "https://train.your-sgp-deployment-url/v1/jobs/$JOB_ID/cancel" \
-H "x-api-key: $SGP_API_KEY" \
-H "x-selected-account-id: $SGP_ACCOUNT_ID"
Returns 409 if the job is already in a terminal state.
Accessing Mounts in Your Container
Inside the training container, mounted storage is available via environment variables regardless of cloud:
import os
data_path = os.environ["STORAGE_MOUNT_0_PATH"]
output_path = os.environ["STORAGE_MOUNT_1_PATH"]
Next Steps