Where Can Advanced Optimization Methods Help in Deep Learning?
Abstract
Modern neural network models are trained using fairly standard stochastic gradient optimizers, sometimes employing mild preconditioners.
A natural question to ask is whether significant improvements in training speed can be obtained through the development of better optimizers.
In this talk I will argue that this is impossible in the large majority of cases, which explains why this area of research has stagnated. I will go on to identify several situations where improved preconditioners can still deliver significant speedups, including exotic architectures and loss functions, and large batch training.