Amazon’s AWS believes it has finally created a cloud service that will break through with HPC and supercomputing customers. The cloud provider announced the commercial availability of Parallel Computing Service (PCS), which the company hopes will finally get skeptical high-performance and supercomputing customers into the cloud.
The Parallel Computing Service (PCS) is a managed service offering allowing customers to set up and manage high-performance computing (HPC) clusters.
“We can now feel that we’ve increased the level of ease of use to ensure that customers can more easily migrate their HPC workloads onto AWS,” Ian Colle, general manager for advanced computing and solutions at AWS, told HPCwire.
Before PCS, orchestrating high-performance computing on AWS wasn’t easy, with a lot of do-it-yourself tools via tools like the open-source ParallelCluster.
PCS takes away that friction, and customers can manage AWS HPC clusters the same way they manage on-premises environments. Colle said that makes it easier for them to migrate workloads to the cloud.
The key addition here is Slurm (Simple Linux Utility for Resource Management) scheduler, which manages the workloads. It also allows customers to orchestrate their own storage, networking, and other components in an HPC cluster.
Colle described Slurm as “the most popular scheduler out there” globally for HPC workloads.
“A number of customers said, ‘If we just had a fully managed Slurm offering on AWS, that would make our lives so much easier.’ We took the learnings from ParallelCluster, and the customer feedback, and that’s what we’ve created,” Colle said.
HPC customers can replicate Slurm scripts from on-premises environments, which should make the transition to cloud-based HPC easier.
“We’re going to help you set up a compute queue and scheduling queue, and you’re going to go off and be able to start running HPC jobs within minutes with AWS,” Colle said.
Breakthrough For Cloud
Colle joined AWS in 2017, and his goal was to make the cloud more accessible to HPC.
“This service … takes away so much of the friction that a customer would have trying to instantiate AWS resources to run their HPC workloads,” Colle said.
HPC customers have been slow to move to the cloud for many reasons. However, Colle said that the managed service would simplify the use of its hardware and software resources.
Amazon started with a series of HPC offerings, including Elastic Fabric Adapter, a low-latency interconnect, our FSX for the Lustre file system.
It introduced its own CPU called Graviton, now in the fourth generation. A chip called Nitro facilitates data movement and security in the AWS infrastructure.
HPC Forced to Cloud?
At ISC, keynote speaker Kathy Yelick said that most accelerators are locked down by hyperscalers, which are not making their chips commercially available.
She also recommended working closely with hyperscalers “to make sure we’re building systems that are of interest to their market as well as our market.”
Colle said PCS could be one such offering as it allows data scientists and researchers to run applications in minutes.
Some HPC customers are determined not to move to the cloud for reasons that include security, bandwidth concerns, and hardware optimizations.
Amazon isn’t selling its ARM Graviton CPUs and is gobbling limited stocks of Nvidia GPUs. AWS also offers its own homegrown inferencing and training chips through its cloud.
Fewer supercomputers are being built, and computing speeds are flattening out. The biggest supercomputers are being built by cloud providers.
That may ultimately force HPC customers to move applications to the cloud.
Rule of the ARM
AWS is thinking big picture regarding power efficiency, and ARM is central to its plans.
“We’re now on our fourth-generation ARM chip, and now with, especially Virtual Fugaku, … we’re enabling customers to run the exact same ARM-based workloads that they were running on an actual supercomputer in Japan,” Colle said.
HPC customers typically optimize their acceleration for Nvidia and AMD GPUs, not for Graviton and its companion accelerators. However, AWS is investing heavily in ARM, much like Nvidia has also invested in ARM for its upcoming Grace-Blackwell GPU.
“In a similar manner, we could craft — similar to the Grace Blackwell– a Graviton Blackwell because of that similarity in the ARM ecosystem,” Colle said.
For now, Colle said, “I’m sure you’re going to hear more from Nvidia about the things that we’re doing with their Blackwell, with their B100s … and B200s.”
However, the ultimate goal is to eliminate the hardware complications so HPC customers can focus on the results.
“We’re excited about Sagemaker because it helps customers use resources efficiently without needing to worry about the underlying hardware details,” Colle said.
Storage Trends
Colle described himself as a Lustre guy and said AWS offers customers a family of file systems to ensure they can find a performant POSIX-compliant file system on AWS.
However, Colle noticed a significant shift in customers’ behavior, who are moving some of their workloads to object storage.
This trend is driven by the realization that many workloads don’t require the full POSIX-compliant stack. Customers are finding that traditional file systems can sometimes impede performance, as it slows down performance a lot.
Additionally, there’s a cost-benefit, as customers can save a lot of money by moving to object storage. Many customers are adopting a hybrid approach by “doing a combination of smart tiering,” Colle said.
This strategy involves using Lustre “when the portion of their workload requires the full POSIX semantics” while “aging it off to S3 for more long-term storage at a lower cost.”
Colle said this approach allows customers to optimize performance and cost-efficiency in their storage solutions.
SLURMing It
Slurm environments can be deployed in PCS via tools, including AWS Management Console, CLI, and standard API calls.
The AWS parallel computing has eased to a point where data scientists can use it directly. AWS has cost controls and budgeting in its standard AWS tools.
“Customers can actually put that decision-making down to the individual scientists and engineer level and allow them to make that decision on how to best migrate their workloads and which workloads can best benefit from HPC,” Colle said.
RONIN, an AWS partner, previously built out the infrastructure on the AWS ParallelCluster open-source toolkit.
“Now they can use more of that standard AWS API-driven development, and they’re going to retool their offerings to use this AWS PCS because of the ease of use and the simplification of being integrated across AWS services,” Colle said.