Date
Thu, 06 Jun 2019
Time
14:00 - 15:00
Location
Rutherford Appleton Laboratory, nr Didcot
Speaker
Dr Mawussi Zounon
Organisation
Numerical Algorithms Group & University of Manchester

As parallel computers approach Exascale (10^18 floating point operations per second), processor failure and data corruption are of increasing concern. Numerical linear algebra solvers are at the heart of many scientific and engineering applications, and with the increasing failure rates, they may fail to compute a solution or produce an incorrect solution. It is therefore crucial to develop novel parallel linear algebra solvers capable of providing correct solutions on unreliable computing systems. The common way to mitigate failures in high performance computing systems consists of periodically saving data onto a reliable storage device such as a remote disk. But considering the increasing failure rate and the ever-growing volume of data involved in numerical simulations, the state-of-the-art fault-tolerant strategies are becoming time consuming, therefore unsuitable for large-scale simulations. In this talk, we will present a  novel class of fault-tolerant algorithms that do not require any additional resources. The key idea is to leverage the knowledge of numerical properties of solvers involved in a simulation to regenerate lost data due to system failures. We will also share the lessons learned and report on the numerical properties and the performance of the new resilience algorithms.

Please contact us with feedback and comments about this page. Last updated on 03 Apr 2022 01:32.