Loading...
Thumbnail Image
Publication

AIMS: An Adaptive Intelligent Multi-Objective Scheduler Powered by Digital Twins

Adimora, Kyrian
Sun, Hongyang
Citations
Altmetric:
Abstract
High-performance computing (HPC) systems face challenges in jointly optimizing energy efficiency, reliability, and job throughout across heterogeneous architectures. Traditional schedulers rely on fixed heuristics that struggle under dynamic conditions and conflicting objectives. We present AIMS (Adaptive Intelligent Multi-Objective Scheduler), a framework that combines uncertainty-aware deep reinforcement learning with digital twin technology for autonomous HPC scheduling. AIMS features a five-layer architecture with four predictive digital twins: fault prediction via LSTM-attention, energy forecasting using a CNN-LSTM hybrid, ensemble performance modeling, and physics-informed thermal analysis. These models supply state information to a Dueling DQN with epistemic uncertainty quantification, enabling adaptive decision-making. Policy gradient-based weight evolution drives multi-objective optimization toward Pareto-efficient scheduling. Tested on 389,620 production job records from Aurora, Polaris, Mira, and Cooley systems, AIMS outperforms ten baselines, including Slurm and PBS Pro, achieving 12.1% higher energy efficiency, 8.7% improved reliability, and 15.3% greater throughput. Under fault conditions, its uncertainty-aware design yields a performance gain of 12.8% versus 5.2% in stable settings. Scalability tests confirm real-time operation at exascale (up to 262K nodes) with sub-50 ms decision latency. AIMS offers a robust and scalable solution for next-generation HPC scheduling where traditional methods fall short.
Description
These are the slides from a presentation given at the IEEE High Performance Extreme Computing Conference held in Boston/virtually on 09/17/2025.
Date
2025-09-17
Journal Title
Journal ISSN
Volume Title
Publisher
University of Kansas
Research Projects
Organizational Units
Journal Issue
Keywords
HPC scheduling, Multi-objective optimization, Deep reinforcement learning, Digital twin, Energy efficiency
Citation
DOI
Embedded videos