> ## Documentation Index
> Fetch the complete documentation index at: https://docs.gp.scale.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Getting Started with Train

> Submit your first GPU training job on Vertex AI, SageMaker, or Azure ML.

This guide walks you through submitting a training job using the Train API.

<h2>Prerequisites</h2>

Before you begin, ensure you have:

* A valid Scale account with SGP access
* Your API key and account ID:

```bash theme={null}
export SGP_API_KEY="your_api_key"
export SGP_ACCOUNT_ID="your_account_id"
```

* A Docker image pushed to the workspace registry (see [Custom Images](/docs/capabilities/training/custom-images))

<h2>Submit a Job</h2>

Jobs are submitted to `POST /v1/{backend}/jobs`. The `{backend}` is one of `vertex-ai`, `sagemaker`, or `azure-ml`. The `job_config` structure mirrors each cloud's native job API.

<Tabs>
  <Tab title="Vertex AI">
    ```bash theme={null}
    curl -X POST "https://train.your-sgp-deployment-url/v1/vertex-ai/jobs" \
      -H "x-api-key: $SGP_API_KEY" \
      -H "x-selected-account-id: $SGP_ACCOUNT_ID" \
      -H "Content-Type: application/json" \
      -d '{
        "name": "mnist-training",
        "job_config": {
          "worker_pool_specs": [{
            "machine_spec": {
              "machine_type": "g2-standard-8",
              "accelerator_type": "NVIDIA_L4",
              "accelerator_count": 1
            },
            "replica_count": 1,
            "container_spec": {
              "image_uri": "docker-registry.your-sgp-deployment-url/pytorch-gpu:v1",
              "command": ["python3", "train.py"],
              "args": ["--epochs", "3"]
            }
          }],
          "scheduling": {"timeout": "86400s"}
        },
        "storage_mounts": [
          {"source_uri": "gs://my-bucket/mnist-data", "mount_path": "/mnt/data", "read_only": true},
          {"source_uri": "gs://my-bucket/output/run-1", "mount_path": "/mnt/output", "read_only": false}
        ]
      }'
    ```

    ```json Response highlight={4} theme={null}
    {
      "id": "87cae349-cd5f-4202-b674-f1652880a8f5",
      "name": "mnist-training",
      "job_type": "vertex_ai",
      "status": "PENDING",
      "cloud_job_identifier": {
        "job_name": "projects/my-project/locations/us-east1/customJobs/1234567890"
      },
      "storage_mounts": [...],
      "account_id": "693af856dde70fd6614ab64c",
      "created_at": "2026-03-11T08:12:25.542507Z",
      "updated_at": "2026-03-11T08:12:25.542507Z"
    }
    ```

    `worker_pool_specs` is an array; add more pools for distributed training. `command` and `args` are separate arrays.
  </Tab>

  <Tab title="SageMaker">
    ```bash theme={null}
    curl -X POST "https://train.your-sgp-deployment-url/v1/sagemaker/jobs" \
      -H "x-api-key: $SGP_API_KEY" \
      -H "x-selected-account-id: $SGP_ACCOUNT_ID" \
      -H "Content-Type: application/json" \
      -d '{
        "name": "mnist-training",
        "job_config": {
          "image": "docker-registry.your-sgp-deployment-url/pytorch-gpu:v1",
          "command": ["python3", "/app/train.py"],
          "resource_config": {
            "instance_count": 1,
            "instance_type": "ml.g4dn.xlarge",
            "volume_size_in_gb": 50
          },
          "hyperparameters": {"learning_rate": "0.001"}
        },
        "storage_mounts": [
          {"source_uri": "s3://my-bucket/mnist-data", "mount_path": "/mnt/data", "read_only": true},
          {"source_uri": "s3://my-bucket/output/run-1", "mount_path": "/mnt/output", "read_only": false}
        ]
      }'
    ```

    ```json Response highlight={4} theme={null}
    {
      "id": "3f2a1b9c-88de-4a12-b901-c2e5f7d04321",
      "name": "mnist-training",
      "job_type": "sagemaker",
      "status": "PENDING",
      "cloud_job_identifier": {
        "training_job_name": "mnist-training-2026-03-11-08-12-25"
      },
      "storage_mounts": [...],
      "account_id": "693af856dde70fd6614ab64c",
      "created_at": "2026-03-11T08:12:25.542507Z",
      "updated_at": "2026-03-11T08:12:25.542507Z"
    }
    ```

    `instance_type` determines GPU: `ml.g4dn.xlarge` = T4, `ml.p4de.24xlarge` = 8×A100 80GB.
  </Tab>

  <Tab title="Azure ML">
    ```bash theme={null}
    curl -X POST "https://train.your-sgp-deployment-url/v1/azure-ml/jobs" \
      -H "x-api-key: $SGP_API_KEY" \
      -H "x-selected-account-id: $SGP_ACCOUNT_ID" \
      -H "Content-Type: application/json" \
      -d '{
        "name": "mnist-training",
        "job_config": {
          "command": "python train.py --epochs 3",
          "environment": {
            "image": "docker-registry.your-sgp-deployment-url/pytorch-gpu:v1"
          },
          "compute": "gpu-cluster",
          "resources": {
            "instance_type": "Standard_NC24ads_A100_v4",
            "instance_count": 1
          },
          "environment_variables": {"LEARNING_RATE": "0.001"}
        },
        "storage_mounts": [
          {"source_uri": "azureml://datastores/traindata/paths/mnist", "mount_path": "/mnt/data", "read_only": true},
          {"source_uri": "azureml://datastores/trainoutput/paths/run-1", "mount_path": "/mnt/output", "read_only": false}
        ]
      }'
    ```

    ```json Response highlight={4} theme={null}
    {
      "id": "a1b2c3d4-ef56-7890-abcd-ef1234567890",
      "name": "mnist-training",
      "job_type": "azure_ml",
      "status": "PENDING",
      "cloud_job_identifier": {
        "job_name": "heroic_grass_22zggyxsxp"
      },
      "storage_mounts": [...],
      "account_id": "693af856dde70fd6614ab64c",
      "created_at": "2026-03-11T08:12:25.542507Z",
      "updated_at": "2026-03-11T08:12:25.542507Z"
    }
    ```

    `command` is a single shell string (not an array). `compute` can be a named cluster or `"serverless"`.
  </Tab>
</Tabs>

<h2>Job Response</h2>

All backends return the same job object shape:

<ResponseField name="id" type="string" required>
  Unique job identifier. Use this to poll status or cancel the job.
</ResponseField>

<ResponseField name="status" type="string" required>
  Current job state. Transitions through `PENDING → IN_PROGRESS → COMPLETED / FAILED / CANCELED / EXPIRED`.
</ResponseField>

<ResponseField name="cloud_job_identifier" type="object">
  The cloud provider's native job identifier. Shape varies by backend: `job_name` for Vertex AI and Azure ML, `training_job_name` for SageMaker.
</ResponseField>

<ResponseField name="result" type="object">
  Populated on failure. Contains `failure_reason` with the error message from the cloud provider.
</ResponseField>

<h2>Check Status</h2>

```bash theme={null}
curl "https://train.your-sgp-deployment-url/v1/jobs/$JOB_ID" \
  -H "x-api-key: $SGP_API_KEY" \
  -H "x-selected-account-id: $SGP_ACCOUNT_ID"
```

<h2>Other Endpoints</h2>

**List jobs:**

```bash theme={null}
curl "https://train.your-sgp-deployment-url/v1/jobs" \
  -H "x-api-key: $SGP_API_KEY" \
  -H "x-selected-account-id: $SGP_ACCOUNT_ID"
```

**Cancel a job:**

```bash theme={null}
curl -X POST "https://train.your-sgp-deployment-url/v1/jobs/$JOB_ID/cancel" \
  -H "x-api-key: $SGP_API_KEY" \
  -H "x-selected-account-id: $SGP_ACCOUNT_ID"
```

Returns `409` if the job is already in a terminal state.

<h2>Accessing Mounts in Your Container</h2>

Inside the training container, mounted storage is available via environment variables regardless of cloud:

```python theme={null}
import os

data_path = os.environ["STORAGE_MOUNT_0_PATH"]
output_path = os.environ["STORAGE_MOUNT_1_PATH"]
```

<h2>Next Steps</h2>

* **[Custom Images](/docs/capabilities/training/custom-images)**: Build and push your own training containers
* **[Storage Mounts](/docs/capabilities/training/storage-mounts)**: Mount multiple datasets and configure output paths
