Optimize AI Workloads with SageMaker HyperPod Scheduling

Published Date : 15/09/2025

Discover how Amazon SageMaker HyperPod task governance can optimize training efficiency and network latency for your AI workloads by leveraging topology-aware scheduling.

Today, Amazon SageMaker HyperPod introduces a new capability to help optimize the training efficiency and network latency of your AI workloads. This feature, known as topology-aware scheduling, streamlines resource allocation and enhances compute resource utilization across teams and projects on Amazon Elastic Kubernetes Service (Amazon EKS) clusters. Administrators can now govern accelerated compute allocation and enforce task priority policies, which improves resource utilization and enables organizations to focus on accelerating generative AI innovation and reducing time to market.

Generative AI workloads often require extensive network communication across Amazon Elastic Compute Cloud (Amazon EC2) instances. The network bandwidth between these instances significantly impacts both workload runtime and processing latency. The physical placement of instances within a data center’s hierarchical infrastructure plays a crucial role in this. Data centers are organized into nested organizational units such as network nodes and node sets, with multiple instances per network node and multiple network nodes per node set. For example, instances within the same organizational unit experience faster processing times compared to those across different units. This means that fewer network hops between instances result in lower communication latency.

To optimize the placement of your generative AI workloads in your SageMaker HyperPod clusters, you can use EC2 network topology information during your job submissions. EC2 instance topology is described by a set of nodes, with one node in each layer of the network. Refer to the official AWS documentation for details on how EC2 topology is arranged. Network topology labels offer several key benefits:

- Reduced latency by minimizing network hops and routing traffic to nearby instances.

- Improved training efficiency by optimizing workload placement across network resources.

With topology-aware scheduling for SageMaker HyperPod task governance, you can use topology network labels to schedule your jobs with optimized network communication, thereby improving task efficiency and resource utilization for your AI workloads.

In this post, we introduce topology-aware scheduling with SageMaker HyperPod task governance by submitting jobs that represent hierarchical network information. We provide details on how to use SageMaker HyperPod task governance to optimize your job efficiency.

Solution Overview

Data scientists interact with SageMaker HyperPod clusters to train, fine-tune, and deploy models on accelerated compute instances. It’s essential to ensure that data scientists have the necessary capacity and permissions when interacting with clusters of GPUs. To implement topology-aware scheduling, you first confirm the topology information for all nodes in your cluster, run a script to identify instances on the same network nodes, and finally schedule a topology-aware training task on your cluster. This workflow provides higher visibility and control over the placement of your training instances.

Prerequisites

To get started with topology-aware scheduling, you must have the following prerequisites:

- An EKS cluster.

- A SageMaker HyperPod cluster with instances enabled for topology information.

- The SageMaker HyperPod task governance add-on installed (version 1.2.2 or later).

- Kubectl installed.

- (Optional) The SageMaker HyperPod CLI installed.

Get Node Topology Information

Run the following command to show node labels in your cluster. This command provides network topology information for each instance.

```

kubectl get nodes -L topology.k8s.aws/network-node-layer-1

kubectl get nodes -L topology.k8s.aws/network-node-layer-2

kubectl get nodes -L topology.k8s.aws/network-node-layer-3

```

Instances with the same network node layer 3 are as close as possible, following the EC2 topology hierarchy. You should see a list of node labels that look like this: `topology.k8s.aws/network-node-layer-3: nn-33333example`. Run the following script to show the nodes in your cluster that are on the same layers 1, 2, and 3 network nodes:

```

git clone https://github.com/aws-samples/awsome-distributed-training.git

cd awsome-distributed-training/1.architectures/7.sagemaker-hyperpod-eks/task-governance

chmod +x visualize_topology.sh

bash visualize_topology.sh

```

The output of this script will print a flow chart that you can use in a flow diagram editor such as Mermaid.js.org to visualize the node topology of your cluster. The following figure is an example of the cluster topology for a seven-instance cluster.

Submit Tasks

SageMaker HyperPod task governance offers two ways to submit tasks using topology awareness. In this section, we discuss these two options and a third alternative option to task governance.

Modify Your Kubernetes Manifest File

First, you can modify your existing Kubernetes manifest file to include one of two annotation options:

- `kueue.x-k8s.io/podset-required-topology` – Use this option if you must have all pods scheduled on nodes on the same network node layer to begin the job.

- `kueue.x-k8s.io/podset-preferred-topology` – Use this option if you ideally want all pods scheduled on nodes in the same network node layer, but you have some flexibility.

The following code is an example of a sample job that uses the `kueue.x-k8s.io/podset-required-topology` setting to schedule pods that share the same layer 3 network node:

```

apiVersion: batch/v1

kind: Job

metadata:

name: test-tas-job

namespace: hyperpod-ns-team-a

labels:

kueue.x-k8s.io/queue-name: hyperpod-ns-team-a-localqueue

kueue.x-k8s.io/priority-class: inference-priority

spec:

parallelism: 10

completions: 10

suspend: true

template:

metadata:

labels:

kueue.x-k8s.io/queue-name: hyperpod-ns-team-a-localqueue

annotations:

kueue.x-k8s.io/podset-required-topology:

Frequently Asked Questions (FAQS):

Q: What is topology-aware scheduling in Amazon SageMaker HyperPod?

A: Topology-aware scheduling is a feature in Amazon SageMaker HyperPod that optimizes the placement of AI workloads by considering the physical and logical arrangement of resources, reducing network latency and improving training efficiency.

Q: How does topology-aware scheduling improve network latency?

A: By minimizing network hops and routing traffic to nearby instances, topology-aware scheduling reduces the latency of network communications, which is crucial for efficient AI workload processing.

Q: What are the prerequisites for using topology-aware scheduling with SageMaker HyperPod?

A: To use topology-aware scheduling, you need an EKS cluster, a SageMaker HyperPod cluster with instances enabled for topology information, the SageMaker HyperPod task governance add-on installed, Kubectl, and optionally, the SageMaker HyperPod CLI.

Q: How do I get node topology information?

A: You can run kubectl commands to show node labels in your cluster, which provide network topology information. Additionally, you can use a script to visualize the node topology of your cluster.

Q: Can I submit topology-aware tasks using the SageMaker HyperPod CLI?

A: Yes, you can submit topology-aware tasks using the SageMaker HyperPod CLI by including either the `--preferred-topology` or `--required-topology` parameter in your `create job` command.

Optimize AI Workloads with SageMaker HyperPod Scheduling

Discover how Amazon SageMaker HyperPod task governance can optimize training efficiency and network latency for your AI workloads by leveraging topology-aware scheduling.

Frequently Asked Questions (FAQS):

More Related Topics :

Thinking About AI Vision for Your Business? Let's Make It Happen.

Explore our AI-powered tools that can boost your business success.

Watchman AI

Employee Monitoring

ICAO Facial Image App

Container Number Recognition System

Automated Number Plate Recognition

Proctor AI