Advanced Research Computing’s new CPU system will help Virginia Tech get more science done each day | Virginia Tech News

Advanced Research Computing’s new CPU system will help Virginia Tech get more science done each day | Virginia Tech News

Research at Virginia Tech is about to get a boost through a new high-performance computing system available through Advanced Research Computing, a unit of the Division of Information Technology.

The university’s central resource for high-performance computing (HPC), Advanced Research Computing provides systems also known as “clusters,” storage, and visualization resources and is staffed by a team of computational scientists, systems engineers, and developers who provide consulting services to assist researchers in using the unit’s systems and software.

“Virginia Tech cultivates a broad research profile, and we see it as our responsibility to host capable, scalable computational resources that enable researchers across the disciplinary spectrum to tackle cutting-edge discovery,”  said Matthew Brown, computational scientist for Advanced Research Computing (ARC). “ARC’s clusters are constantly running close to their top capacity, reflecting the ongoing and expanding computational work being conducted at Virginia Tech. Our latest computing architecture makes it possible to create simulations and analyses from traditional HPC workloads in far greater detail than we ever could before.” 

Meet Owl

Owl is Advanced Research Computing’s newest CPU cluster — CPU stands for central processing unit, often dubbed the “brains” behind a computer. CPU clusters are optimal for researchers who need to perform a series of calculations on their data because they excel at completing a task and moving on to the next one very quickly.

Owl contains 84 nodes, the individual computers within the cluster, with a total of 8,064 processing cores and 768 gigabytes of DDR5 memory with an additional three huge-memory nodes with two 4-terabyte nodes and one 8-terabyte node.

With high memory per core, computations can fly

For a computer to work really fast, it needs a lot of processing power, which comes from the number and speed of the computing cores. But you also need excellent memory — in terms of speed, quantity, and connectivity —  to handle the workload at hand.  

Think of it like a highway, where the cores set your speed limit and memory provides lanes for your data to travel in. With powerful cores, calculations can be done extremely quickly, but if there’s only one lane, only so much data can move through in any given time period. Increasing memory is like increasing the number of lanes on the highway.

Owl is an eight-lane highway, so to speak. Compared to Advanced Research Computing’s other large CPU cluster, TinkerCliffs, which has 2 gigabytes of memory per core, Owl has 8 gigabytes. This allows researchers using Owl to

  • Conduct more types of calculations simultaneously
  • Increase the amount of detail in data simulations for more detailed results
  • Run jobs quickly and make any needed adjustments sooner in the research process 
  • Turn around the results of research more quickly

Direct cooling improves performance

Owl is the first cluster on Virginia Tech’s campus to use direct-to-node cooling. With this setup, a network of small ducts carrying liquid coolant run throughout each node alongside the components that create the most heat, providing near-instant cooling via conduction. This eliminates the need for bulky and loud fans while providing the most efficient cooling possible for Owl’s hard-working cores. It also eliminates thermal throttling, which happens when the cluster reduces its computing speed to prevent overheating.

“The effect of power usage effectiveness of a data center utilizing direct-to-node cooling is significant,” said Jeremy Johnson, Advanced Research Computing’s IT operations manager.

Power usage effectiveness measures the amount of power a data center uses and is expressed by a ratio of the total energy required to run the facility by the energy used for computing. The lower the power usage effectiveness, the more energy efficient a high-performance computing cluster is.


link