> ## Documentation Index
> Fetch the complete documentation index at: https://docs.gp.scale.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Introduction to Train

> Run GPU training jobs across GCP, AWS, and Azure through a unified API.

**Train is SGP's multi-cloud training job service.** It provides a unified REST API for submitting GPU training jobs to Vertex AI (GCP), SageMaker (AWS), and Azure ML, handling cloud-specific job submission, image resolution, storage mounting, and status tracking transparently.

***

## Why Use Train?

Each cloud ML platform has its own job submission API, container registry, storage format, and authentication model. Teams that want cloud flexibility end up maintaining separate training pipelines per cloud, or locking into one provider early.

Train solves this by sitting in front of all three clouds with a single API surface. You submit a job with an image and a command; Train takes care of pushing the image to the right registry, mounting your storage, and translating the job config into the cloud's native format. Your training code doesn't know or care which cloud it's running on.

* **Single API for all backends:** `vertex-ai`, `sagemaker`, and `azure-ml` are selectable per job
* **Unified image registry:** Push Docker images once to the registry proxy; Train resolves them to the correct cloud registry at submission time
* **Cross-cloud storage mounts:** Mount GCS, S3, or Azure Blob Storage with `STORAGE_MOUNT_*_PATH` env vars, using the same interface everywhere
* **Consistent job lifecycle:** `PENDING → IN_PROGRESS → COMPLETED / FAILED / CANCELED / EXPIRED` across all backends

***

## How It Works

### Train Service

The Train service has two components: a gateway that accepts job submissions and returns a job ID, and a background poller that tracks running jobs and updates their status.

### Supported Backends

| Backend   | Cloud service         | `{backend}` value |
| --------- | --------------------- | ----------------- |
| Vertex AI | GCP Custom Jobs       | `vertex-ai`       |
| SageMaker | AWS Training Jobs     | `sagemaker`       |
| Azure ML  | Azure ML Command Jobs | `azure-ml`        |

### Container Registry

Each workspace has a Docker-compatible registry. You push images to it; Train ensures the exact image you pushed is what runs on the cloud backend, even if the tag is updated later.

***

## Storage Mounts

Training containers on all clouds receive `STORAGE_MOUNT_0_PATH`, `STORAGE_MOUNT_1_PATH`, etc. as environment variables pointing to mounted directories. Your training script reads from `os.environ["STORAGE_MOUNT_0_PATH"]`, with no cloud-specific paths or SDKs needed.

Source URIs use cloud-native formats:

| Cloud | URI format                             |
| ----- | -------------------------------------- |
| GCP   | `gs://bucket/path`                     |
| AWS   | `s3://bucket/path`                     |
| Azure | `azureml://datastores/name/paths/path` |

***

## Common Use Cases

* **Improve agent accuracy on your domain:** Fine-tune a foundation model on your proprietary data so your agents produce more reliable, on-brand responses and make fewer errors on the tasks that matter to your business.
* **Optimize agent performance with feedback:** Use evaluations and human feedback collected in SGP to run reinforcement learning pipelines that directly improve how your agents behave over time.
* **Reduce inference cost and latency:** Distill a large general-purpose model into a smaller one trained specifically for your use case, so agents respond faster and cost less to run at scale.

***

## Authentication

Train uses the same authentication as other SGP services:

```
x-api-key: <your_api_key>
x-selected-account-id: <your_account_id>
```

The registry proxy accepts Docker Basic auth (`account_id:api_key`) during `docker login`, then issues a short-lived token for push and pull operations.

***

## Getting Started

To begin using Train, you'll need a Scale account with SGP access and credentials for at least one cloud backend.

<CardGroup cols={3}>
  <Card title="Getting Started" icon="rocket" href="/docs/capabilities/training/getting-started-with-train">
    Submit your first training job
  </Card>

  <Card title="Custom Images" icon="docker" href="/docs/capabilities/training/custom-images">
    Push and use your own Docker images
  </Card>

  <Card title="Storage Mounts" icon="hard-drive" href="/docs/capabilities/training/storage-mounts">
    Mount training data and output directories across clouds
  </Card>
</CardGroup>
