Skip to main content
Train mounts cloud storage into your container before the job starts and sets STORAGE_MOUNT_0_PATH, STORAGE_MOUNT_1_PATH, etc. as environment variables. Your training script reads from these paths, with no cloud-specific SDKs or storage code needed.

Mount Configuration

Each entry in storage_mounts maps a cloud URI to a container path:
"storage_mounts": [
  {
    "source_uri": "gs://my-bucket/training-data",
    "mount_path": "/mnt/data",
    "read_only": true
  },
  {
    "source_uri": "gs://my-bucket/output/run-1",
    "mount_path": "/mnt/output",
    "read_only": false
  }
]
Inside the container:
STORAGE_MOUNT_0_PATH=/mnt/data
STORAGE_MOUNT_1_PATH=/mnt/output
Mounts are indexed in order, starting at zero.

URI Formats

CloudFormatExample
GCPgs://bucket/pathgs://my-bucket/datasets/imagenet
AWSs3://bucket/paths3://my-bucket/datasets/imagenet
Azureazureml://datastores/name/paths/pathazureml://datastores/training/paths/imagenet
Azure storage mounts reference registered Azure ML datastores, not raw Blob Storage URLs. Datastores are configured in your Azure ML workspace.

Accessing Mounts in Your Container

import os
import torch
from datasets import load_dataset

data_path = os.environ["STORAGE_MOUNT_0_PATH"]
output_path = os.environ["STORAGE_MOUNT_1_PATH"]

dataset = load_dataset(data_path)

# ... training loop ...
torch.save(model.state_dict(), f"{output_path}/checkpoint-epoch-{epoch}.pt")
This works identically on GCP, AWS, and Azure.

Read vs. Write

Set read_only based on intent. Use true for input data and pretrained weights; false for outputs, checkpoints, and logs. Marking input mounts read-only prevents accidental writes and may improve performance on some backends.

SageMaker: Input Data Config

SageMaker also supports input_data_config for S3 input channels. Unlike FUSE mounts, channels are downloaded to the instance before training starts, which can be faster for large datasets with random access patterns:
"input_data_config": [{
  "channel_name": "train",
  "data_source": {
    "s3_data_source": {
      "s3_data_type": "S3Prefix",
      "s3_uri": "s3://my-bucket/datasets/train/",
      "s3_data_distribution_type": "FullyReplicated"
    }
  },
  "input_mode": "File"
}]
SageMaker makes this available at /opt/ml/input/data/train/. This coexists with storage_mounts.

Next Steps