AWS Perfects Cloud Service for Supercomputing Customers (2024)

Amazon’s AWS believes it has finally created a cloud service that will break through with HPC and supercomputing customers. The cloud provider announced the commercial availability of Parallel Computing Service (PCS), which the company hopes will finally get skeptical high-performance and supercomputing customers into the cloud.

The Parallel Computing Service (PCS) is a managed service offering allowing customers to set up and manage high-performance computing (HPC) clusters.

“We can now feel that we’ve increased the level of ease of use to ensure that customers can more easily migrate their HPC workloads onto AWS,” Ian Colle, general manager for advanced computing and solutions at AWS, told HPCwire.

Before PCS, orchestrating high-performance computing on AWS wasn’t easy, with a lot of do-it-yourself tools via tools like the open-source ParallelCluster.

PCS takes away that friction, and customers can manage AWS HPC clusters the same way they manage on-premises environments. Colle said that makes it easier for them to migrate workloads to the cloud.

Breakthrough For Cloud

Colle joined AWS in 2017, and his goal was to make the cloud more accessible to HPC.

“This service … takes away so much of the friction that a customer would have trying to instantiate AWS resources to run their HPC workloads,” Colle said.

HPC customers have been slow to move to the cloud for many reasons. However, Colle said that the managed service would simplify the use of its hardware and software resources.

Amazon started with a series of HPC offerings, including Elastic Fabric Adapter, a low-latency interconnect, our FSX for the Lustre file system.

It introduced its own CPU called Graviton, now in the fourth generation. A chip called Nitro facilitates data movement and security in the AWS infrastructure.

HPC Forced to Cloud?

At ISC, keynote speaker Kathy Yelick said that most accelerators are locked down by hyperscalers, which are not making their chips commercially available.

Rule of the ARM

AWS is thinking big picture regarding power efficiency, and ARM is central to its plans.

“We’re now on our fourth-generation ARM chip, and now with, especially Virtual Fugaku, … we’re enabling customers to run the exact same ARM-based workloads that they were running on an actual supercomputer in Japan,” Colle said.

HPC customers typically optimize their acceleration for Nvidia and AMD GPUs, not for Graviton and its companion accelerators. However, AWS is investing heavily in ARM, much like Nvidia has also invested in ARM for its upcoming Grace-Blackwell GPU.

“In a similar manner, we could craft — similar to the Grace Blackwell– a Graviton Blackwell because of that similarity in the ARM ecosystem,” Colle said.

For now, Colle said, “I’m sure you’re going to hear more from Nvidia about the things that we’re doing with their Blackwell, with their B100s … and B200s.”

However, the ultimate goal is to eliminate the hardware complications so HPC customers can focus on the results.

“We’re excited about Sagemaker because it helps customers use resources efficiently without needing to worry about the underlying hardware details,” Colle said.

Storage Trends

Colle described himself as a Lustre guy and said AWS offers customers a family of file systems to ensure they can find a performant POSIX-compliant file system on AWS.

However, Colle noticed a significant shift in customers’ behavior, who are moving some of their workloads to object storage.

This trend is driven by the realization that many workloads don’t require the full POSIX-compliant stack. Customers are finding that traditional file systems can sometimes impede performance, as it slows down performance a lot.

Additionally, there’s a cost-benefit, as customers can save a lot of money by moving to object storage. Many customers are adopting a hybrid approach by “doing a combination of smart tiering,” Colle said.

This strategy involves using Lustre “when the portion of their workload requires the full POSIX semantics” while “aging it off to S3 for more long-term storage at a lower cost.”

Colle said this approach allows customers to optimize performance and cost-efficiency in their storage solutions.

SLURMing It

Slurm environments can be deployed in PCS via tools, including AWS Management Console, CLI, and standard API calls.

The AWS parallel computing has eased to a point where data scientists can use it directly. AWS has cost controls and budgeting in its standard AWS tools.

“Customers can actually put that decision-making down to the individual scientists and engineer level and allow them to make that decision on how to best migrate their workloads and which workloads can best benefit from HPC,” Colle said.

RONIN, an AWS partner, previously built out the infrastructure on the AWS ParallelCluster open-source toolkit.

“Now they can use more of that standard AWS API-driven development, and they’re going to retool their offerings to use this AWS PCS because of the ease of use and the simplification of being integrated across AWS services,” Colle said.