Skip to main content

Overview

System Manager is a Kubernetes operator used to deploy and manage the SGP platform. It is responsible for deploying the SGP platform services and agents to an existing Kubernetes cluster.

Installation

System Manager is installed as a Helm chart into the Kubernetes cluster during the deployment of the SGP platform. See your cloud provider’s corresponding deployment guide for more information.

Configuration

System Manager is configured via system-manager-config.json. This file is stored in the cloud provider’s secret manager. To modify the configuration, you can either update the secret directly or use the System Manager GUI, then restart the System Manager deployment to apply the changes.
kubectl rollout restart deployment sgp-system-manager -n sgp-system-manager
An example configuration file is shown below. The aws block is only present for AWS deployments; GCP and Azure deployments use equivalent gcp and azure blocks instead. The baseRepository format also varies by cloud provider (e.g., oci://<region>-docker.pkg.dev/<project-id>/sgp-<workspace_id>-helm-repository for GCP, oci://<account_id>.dkr.ecr.<region>.amazonaws.com/sgp-<workspace_id>-helm-repository for AWS).
{
  "cloudProvider": "<aws|azure|gcp>",
  "baseRepository": "<helm-repository-oci-url>",
  "samlSetupEnabled": true,
  "oidcSetupEnabled": true,
  "deploymentURL": "https://<workspace_id>.workspace.egp.scale.com",
  "workspaceId": "<workspace_id>",
  "authType": "<saml|oidc>",
  "baseDomain": "<base_domain>",
  "scaletrain_tenant_prefix": "<scaletrain_tenant_prefix>",
  "train_tenant_prefix": "<train_tenant_prefix>",
  "deployAgentex": true,
  "deploySae": true,
  "aws": {
    "accountId": "<aws_account_id>",
    "region": "<aws_region>",
    "prefix": "<prefix>",
    "modelEngineS3Bucket": "scale-egp-<workspace_id>-ml",
    "sqsQueuePolicyTemplate": "",
    "sqsQueueTagTemplate": "",
    "clusterName": "<cluster_name>",
    "karpenterIrsaArn": "<karpenter_irsa_arn>",
    "targetGroupArn": "<target_group_arn>",
    "nodeSubnets": "<node_subnets>",
    "nodeSecurityGroup": "<node_security_group>",
    "postgresHostTemporal": "<postgres_host_temporal>",
    "compassBucketName": "<compass_bucket_name>",
    "compassMongoHost": "<compass_mongo_host>",
    "compassRedisHost": "<compass_redis_host>",
    "reductoBucketName": "<reducto_bucket_name>",
    "reductoDatabaseUrl": "<reducto_database_url>",
    "reductoIrsaRoleArn": "<reducto_irsa_role_arn>",
    "reductoAzureVisionEndpoint": "<reducto_azure_vision_endpoint>",
    "reductoAzureVisionKey": "<reducto_azure_vision_key>",
    "codeBuildProjectName": "<codebuild_project_name>",
    "codeBuildS3Bucket": "<codebuild_s3_bucket>",
    "codeBuildRegistryUrl": "<codebuild_ecr_registry_url>",
    "codeBuildServiceRoleArn": "<codebuild_service_role_arn>",
    "cloudDeployEnabled": true,
    "dex": {
      "irsaRoleArn": "<duc_api_backend_irsa_role_arn>",
      "prefix": "duc-<workspace_id>"
    },
    "identities": {
      "sgpModels": {
        "irsaArn": "<sgp_models_irsa_arn>",
        "secretArns": {
          "backend": "<sgp_models_backend_secret_arn>",
          "model-providers": "<sgp_models_provider_secret_arn>"
        }
      }
    },
    "train": {
      "irsaRoleArn": "<train_irsa_role_arn>",
      "sagemakerExecutionRoleArn": "<train_sagemaker_execution_role_arn>",
      "sagemakerSecurityGroupId": "<train_sagemaker_security_group_id>",
      "databaseHost": "<train_database_host>",
      "dataBucket": "<train_s3_data_bucket_name>",
      "checkpointsBucket": "<train_s3_checkpoints_bucket_name>",
      "outputBucket": "<train_s3_output_bucket_name>",
      "stagingBucket": "<train_s3_staging_bucket_name>"
    },
    "registryProxy": {
      "irsaRoleArn": "<registry_proxy_irsa_role_arn>",
      "ecrRepositoryUrl": "<registry_proxy_ecr_repository_url>",
      "jwtSecret": "<registry_proxy_jwt_secret>",
      "workspaceConfig": "<workspace_config_json>"
    }
  },
  "frontDoorSSLCertB64": "<istio_certificate_b64_resolved>",
  "frontDoorSSLKeyB64": "<istio_key_b64_resolved>",
  "initialDesiredState": "<desired_state_json>",
  "datadog": {
    "enabled": true,
    "env": "<datadog_context>",
    "clusterName": "<cluster_name>",
    "secretName": "<datadog_secret_name>",
    "irsaRoleArn": "<datadog_irsa_role_arn>"
  }
}

Architecture

System Manager runs as a deployment in the sgp-system-manager namespace. Its GUI is accessible on port 8000.
kubectl port-forward deployment/sgp-system-manager -n sgp-system-manager 8000:8000
Then navigate to http://localhost:8000 in your browser to access the GUI.

Packs

System Manager organizes services into “packs”. Each pack is a collection of resources that are deployed together. Generally packs are composed of FluxCD HelmRelease as well as other resources necessary to support a particular service. When a pack is installed, System Manager renders its resource templates and writes the resulting Kubernetes resources — namespaces, secrets, and FluxCD HelmRelease CRDs — to the cluster. FluxCD then picks up the HelmRelease CRDs and handles pulling and deploying the Helm charts.

FluxCD

System Manager offloads resource reconciliation to FluxCD. FluxCD is a tool that allows you to manage the lifecycle of your Kubernetes resources. It is responsible for ensuring that the desired state of the resources is maintained. For more information on FluxCD, see the FluxCD documentation.

desired-state.json

The collection of packs that System Manager will deploy is defined in the desired-state.json file. This file is stored in the cloud provider’s secret manager. To modify the desired state, you can either update the secret directly or use the System Manager GUI, then trigger reconciliation via the System Manager GUI. A sample desired state file is shown below:
{
  "version": "0.1",
  "packs": [
    { "name": "flux" },
    { "name": "sgp-helm-repository" },
    { "name": "istio" },
    { "name": "spicedb" },
    { "name": "identity-service" },
    { "name": "temporalf" },
    { "name": "sgp-apps" }
  ]
}