Skip to main content
This guide walks you through submitting a training job using the Train API.

Prerequisites

Before you begin, ensure you have:
  • A valid Scale account with SGP access
  • Your API key and account ID:
export SGP_API_KEY="your_api_key"
export SGP_ACCOUNT_ID="your_account_id"
  • A Docker image pushed to the workspace registry (see Custom Images)

Submit a Job

Jobs are submitted to POST /v1/{backend}/jobs. The {backend} is one of vertex-ai, sagemaker, or azure-ml. The job_config structure mirrors each cloud’s native job API.
curl -X POST "https://train.your-sgp-deployment-url/v1/vertex-ai/jobs" \
  -H "x-api-key: $SGP_API_KEY" \
  -H "x-selected-account-id: $SGP_ACCOUNT_ID" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "mnist-training",
    "job_config": {
      "worker_pool_specs": [{
        "machine_spec": {
          "machine_type": "g2-standard-8",
          "accelerator_type": "NVIDIA_L4",
          "accelerator_count": 1
        },
        "replica_count": 1,
        "container_spec": {
          "image_uri": "docker-registry.your-sgp-deployment-url/pytorch-gpu:v1",
          "command": ["python3", "train.py"],
          "args": ["--epochs", "3"]
        }
      }],
      "scheduling": {"timeout": "86400s"}
    },
    "storage_mounts": [
      {"source_uri": "gs://my-bucket/mnist-data", "mount_path": "/mnt/data", "read_only": true},
      {"source_uri": "gs://my-bucket/output/run-1", "mount_path": "/mnt/output", "read_only": false}
    ]
  }'
Response
{
  "id": "87cae349-cd5f-4202-b674-f1652880a8f5",
  "name": "mnist-training",
  "job_type": "vertex_ai",
  "status": "PENDING",
  "cloud_job_identifier": {
    "job_name": "projects/my-project/locations/us-east1/customJobs/1234567890"
  },
  "storage_mounts": [...],
  "account_id": "693af856dde70fd6614ab64c",
  "created_at": "2026-03-11T08:12:25.542507Z",
  "updated_at": "2026-03-11T08:12:25.542507Z"
}
worker_pool_specs is an array; add more pools for distributed training. command and args are separate arrays.

Job Response

All backends return the same job object shape:
id
string
required
Unique job identifier. Use this to poll status or cancel the job.
status
string
required
Current job state. Transitions through PENDING → IN_PROGRESS → COMPLETED / FAILED / CANCELED / EXPIRED.
cloud_job_identifier
object
The cloud provider’s native job identifier. Shape varies by backend: job_name for Vertex AI and Azure ML, training_job_name for SageMaker.
result
object
Populated on failure. Contains failure_reason with the error message from the cloud provider.

Check Status

curl "https://train.your-sgp-deployment-url/v1/jobs/$JOB_ID" \
  -H "x-api-key: $SGP_API_KEY" \
  -H "x-selected-account-id: $SGP_ACCOUNT_ID"

Other Endpoints

List jobs:
curl "https://train.your-sgp-deployment-url/v1/jobs" \
  -H "x-api-key: $SGP_API_KEY" \
  -H "x-selected-account-id: $SGP_ACCOUNT_ID"
Cancel a job:
curl -X POST "https://train.your-sgp-deployment-url/v1/jobs/$JOB_ID/cancel" \
  -H "x-api-key: $SGP_API_KEY" \
  -H "x-selected-account-id: $SGP_ACCOUNT_ID"
Returns 409 if the job is already in a terminal state.

Accessing Mounts in Your Container

Inside the training container, mounted storage is available via environment variables regardless of cloud:
import os

data_path = os.environ["STORAGE_MOUNT_0_PATH"]
output_path = os.environ["STORAGE_MOUNT_1_PATH"]

Next Steps