As parallel computers approach Exascale (10^18 floating point operations per second), processor failure and data corruption are of increasing concern. Numerical linear algebra solvers are at the heart of many scientific and engineering applications, and with the increasing failure rates, they may fail to compute a solution or produce an incorrect solution. It is therefore crucial to develop novel parallel linear algebra solvers capable of providing correct solutions on unreliable computing systems. The common way to mitigate failures in high performance computing systems consists of periodically saving data onto a reliable storage device such as a remote disk. But considering the increasing failure rate and the ever-growing volume of data involved in numerical simulations, the state-of-the-art fault-tolerant strategies are becoming time consuming, therefore unsuitable for large-scale simulations. In this talk, we will present a novel class of fault-tolerant algorithms that do not require any additional resources. The key idea is to leverage the knowledge of numerical properties of solvers involved in a simulation to regenerate lost data due to system failures. We will also share the lessons learned and report on the numerical properties and the performance of the new resilience algorithms.
- Computational Mathematics and Applications Seminar