Argonne-Led Team Wins Gordon Bell Special Prize for COVID-19 Research

Using the ALCF’s Polaris and Cerebras CS-2 systems, researchers developed the first genome-scale language model to study the evolutionary dynamics of SARS-CoV-2.

Members of the GenSLMs team gather in front of the U.S. Department of Energy's booth at the SC22 conference.

A multi-institutional team led by researchers from Argonne National Laboratory was awarded the ACM Gordon Bell Special Prize for HPC-Based COVID-19 Research at the SC22 conference for their innovative use of large language models (LLMs), employing the AI method to quickly identify potential SARS-CoV-2 variants of concern.

The team, which also included researchers from the University of Chicago, NVIDIA, Cerebras Systems, University of Illinois Chicago, Northern Illinois University, Caltech, Harvard University, Arizona State University, and Technical University of Munich, was recognized for their efforts to create the first genome-scale language models (GenSLMs) for understanding the evolution of SARS-CoV-2. Their research has the potential to transform how scientists identify and classify new and emergent variants of SARS-CoV-2 and other pandemic-causing viruses.

Breaking New Ground with LLMs

The researchers used the ALCF’s Polaris supercomputer, the Cerebras CS-2 system in the ALCF AI Testbed, and NVIDIA’s Selene supercomputer to support their groundbreaking work to develop and train large language models to track genetic mutations in SARS-CoV-2 and predict variants of concern.

The project involved creating some of the largest biological LMs (models with 2.5 and 25 billion trainable parameters) to date, trained across a diverse set of over 100 million prokaryotic gene sequences. This represents one of the first foundation models trained on raw nucleotide sequences to demonstrate substantial improvement in predictive performance in identifying variants.

As part of the effort, the team also showcased training and scaling foundation models on both conventional GPU-based supercomputers (Polaris and Selene) and emerging AI accelerators (Cerebras CS-2), and attained high watermarks for time-to-solution (model performance described by its perplexity or accuracy). Demonstrating that training GenSLMs can be intensive, the team achieved nearly 1.5 zettaflops over the course of the training runs.

The team has made their models and results available to the scientific community for further research, noting that its full potential on large biological data has yet to be realized. Their innovative research campaign offers a glimpse into the future where the integration of HPC and AI resources will continue to enable exciting new science opportunities and outcomes.