ECpE’s Zheng receives NSF grant for proposing reliable HPC file system checkers to prevent data corruption

Most likely, all of us have data that we don’t want to lose. But how do we know that data will be reliably kept? Where is it stored, and how does it stay secure?

In national labs, large companies or universities, a high-performance computing (HPC) center is often needed to store large quantities of data. At Iowa State University, there is an HPC center located in Durham Hall.

HPC’s, or computers that operate at a higher level of computation than a regular computer, tend to use larger-scale file systems, putting them at risk for more data corruption.

Mai Zheng, assistant professor in Electrical and Computer Engineering at Iowa State, just received a $299,999 NSF award for designing a system to enforce the safety and reliability of the high-performance computers that may secure this kind of data. This project is in collaboration with the University of North Carolina at Charlotte with a combined funding of $599,681.

“We all have some form of data,” Zheng said. “Much of that data is processed in HPC

centers by thousands of machines. Those machines are very complicated, and our project will help to make those large-scale systems more reliable so that we will never lose our data and experiments we are running.”

The team’s project aims to allow HPC systems to be checked and repaired efficiently and consistently to avoid error and corruption.

When an HPC platform experiences any failure, like a hardware fault or software bug, actions are taken to immediately restore the platform. But when all else fails, a checking and repairing program called a file system checker is put into place. Currently, file system checkers are both time consuming and prone to error.

“We are trying to design a more efficient and more robust checker to fix the problems that can come within large-scale HPC centers,” Zheng said. “The focus of this project is to design a new paradigm for building file system checkers for reliable high performance computing. Checkers are a critical component to bring a system back to the correct state.”

In the past, large power outages have happened in large-scale HPC centers and resulted in a lot of data destruction. Because of a constant fear of data loss, Zheng was inspired to pursue this current project.

The success of this project could completely transform how parallel file system checkers would be used, and HPCs would become more dependable and secure when holding data.