Getting Started with Train

This guide walks you through submitting a training job using the Train API.

Prerequisites

Before you begin, ensure you have:

A valid Scale account with SGP access
Your API key and account ID:

export SGP_API_KEY="your_api_key"
export SGP_ACCOUNT_ID="your_account_id"

A Docker image pushed to the workspace registry (see Custom Images)

Submit a Job

Jobs are submitted to POST /v1/{backend}/jobs. The {backend} is one of vertex-ai, sagemaker, or azure-ml. The job_config structure mirrors each cloud’s native job API.

Vertex AI
SageMaker
Azure ML

curl -X POST "https://train.your-sgp-deployment-url/v1/vertex-ai/jobs" \
  -H "x-api-key: $SGP_API_KEY" \
  -H "x-selected-account-id: $SGP_ACCOUNT_ID" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "mnist-training",
    "job_config": {
      "worker_pool_specs": [{
        "machine_spec": {
          "machine_type": "g2-standard-8",
          "accelerator_type": "NVIDIA_L4",
          "accelerator_count": 1
        },
        "replica_count": 1,
        "container_spec": {
          "image_uri": "docker-registry.your-sgp-deployment-url/pytorch-gpu:v1",
          "command": ["python3", "train.py"],
          "args": ["--epochs", "3"]
        }
      }],
      "scheduling": {"timeout": "86400s"}
    },
    "storage_mounts": [
      {"source_uri": "gs://my-bucket/mnist-data", "mount_path": "/mnt/data", "read_only": true},
      {"source_uri": "gs://my-bucket/output/run-1", "mount_path": "/mnt/output", "read_only": false}
    ]
  }'

Response

{
  "id": "87cae349-cd5f-4202-b674-f1652880a8f5",
  "name": "mnist-training",
  "job_type": "vertex_ai",
  "status": "PENDING",
  "cloud_job_identifier": {
    "job_name": "projects/my-project/locations/us-east1/customJobs/1234567890"
  },
  "storage_mounts": [...],
  "account_id": "693af856dde70fd6614ab64c",
  "created_at": "2026-03-11T08:12:25.542507Z",
  "updated_at": "2026-03-11T08:12:25.542507Z"
}

worker_pool_specs is an array; add more pools for distributed training. command and args are separate arrays.

curl -X POST "https://train.your-sgp-deployment-url/v1/sagemaker/jobs" \
  -H "x-api-key: $SGP_API_KEY" \
  -H "x-selected-account-id: $SGP_ACCOUNT_ID" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "mnist-training",
    "job_config": {
      "image": "docker-registry.your-sgp-deployment-url/pytorch-gpu:v1",
      "command": ["python3", "/app/train.py"],
      "resource_config": {
        "instance_count": 1,
        "instance_type": "ml.g4dn.xlarge",
        "volume_size_in_gb": 50
      },
      "hyperparameters": {"learning_rate": "0.001"}
    },
    "storage_mounts": [
      {"source_uri": "s3://my-bucket/mnist-data", "mount_path": "/mnt/data", "read_only": true},
      {"source_uri": "s3://my-bucket/output/run-1", "mount_path": "/mnt/output", "read_only": false}
    ]
  }'

Response

{
  "id": "3f2a1b9c-88de-4a12-b901-c2e5f7d04321",
  "name": "mnist-training",
  "job_type": "sagemaker",
  "status": "PENDING",
  "cloud_job_identifier": {
    "training_job_name": "mnist-training-2026-03-11-08-12-25"
  },
  "storage_mounts": [...],
  "account_id": "693af856dde70fd6614ab64c",
  "created_at": "2026-03-11T08:12:25.542507Z",
  "updated_at": "2026-03-11T08:12:25.542507Z"
}

instance_type determines GPU: ml.g4dn.xlarge = T4, ml.p4de.24xlarge = 8×A100 80GB.

curl -X POST "https://train.your-sgp-deployment-url/v1/azure-ml/jobs" \
  -H "x-api-key: $SGP_API_KEY" \
  -H "x-selected-account-id: $SGP_ACCOUNT_ID" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "mnist-training",
    "job_config": {
      "command": "python train.py --epochs 3",
      "environment": {
        "image": "docker-registry.your-sgp-deployment-url/pytorch-gpu:v1"
      },
      "compute": "gpu-cluster",
      "resources": {
        "instance_type": "Standard_NC24ads_A100_v4",
        "instance_count": 1
      },
      "environment_variables": {"LEARNING_RATE": "0.001"}
    },
    "storage_mounts": [
      {"source_uri": "azureml://datastores/traindata/paths/mnist", "mount_path": "/mnt/data", "read_only": true},
      {"source_uri": "azureml://datastores/trainoutput/paths/run-1", "mount_path": "/mnt/output", "read_only": false}
    ]
  }'

Response

{
  "id": "a1b2c3d4-ef56-7890-abcd-ef1234567890",
  "name": "mnist-training",
  "job_type": "azure_ml",
  "status": "PENDING",
  "cloud_job_identifier": {
    "job_name": "heroic_grass_22zggyxsxp"
  },
  "storage_mounts": [...],
  "account_id": "693af856dde70fd6614ab64c",
  "created_at": "2026-03-11T08:12:25.542507Z",
  "updated_at": "2026-03-11T08:12:25.542507Z"
}

command is a single shell string (not an array). compute can be a named cluster or "serverless".

Job Response

All backends return the same job object shape:

string

required

Unique job identifier. Use this to poll status or cancel the job.

status

string

required

Current job state. Transitions through PENDING → IN_PROGRESS → COMPLETED / FAILED / CANCELED / EXPIRED.

cloud_job_identifier

object

The cloud provider’s native job identifier. Shape varies by backend: job_name for Vertex AI and Azure ML, training_job_name for SageMaker.

result

object

Populated on failure. Contains failure_reason with the error message from the cloud provider.

Check Status

curl "https://train.your-sgp-deployment-url/v1/jobs/$JOB_ID" \
  -H "x-api-key: $SGP_API_KEY" \
  -H "x-selected-account-id: $SGP_ACCOUNT_ID"

Other Endpoints

List jobs:

curl "https://train.your-sgp-deployment-url/v1/jobs" \
  -H "x-api-key: $SGP_API_KEY" \
  -H "x-selected-account-id: $SGP_ACCOUNT_ID"

Cancel a job:

curl -X POST "https://train.your-sgp-deployment-url/v1/jobs/$JOB_ID/cancel" \
  -H "x-api-key: $SGP_API_KEY" \
  -H "x-selected-account-id: $SGP_ACCOUNT_ID"

Returns 409 if the job is already in a terminal state.

Accessing Mounts in Your Container

Inside the training container, mounted storage is available via environment variables regardless of cloud:

import os

data_path = os.environ["STORAGE_MOUNT_0_PATH"]
output_path = os.environ["STORAGE_MOUNT_1_PATH"]

Next Steps

Custom Images: Build and push your own training containers
Storage Mounts: Mount multiple datasets and configure output paths

Getting Started

Document Understanding

OCR

Workflows

Training

Getting Started with Train

Prerequisites

Submit a Job

Job Response

Check Status

Other Endpoints

Accessing Mounts in Your Container

Next Steps

​Prerequisites

​Submit a Job

​Job Response

​Check Status

​Other Endpoints

​Accessing Mounts in Your Container

​Next Steps

Prerequisites

Submit a Job

Job Response

Check Status

Other Endpoints

Accessing Mounts in Your Container

Next Steps