Heavy-Tailed Large Deviations and Sharp Characterization of Global Dynamics of SGDs in Deep Learning
Abstract
While the typical behaviors of stochastic systems are often deceptively oblivious to the tail distributions of the underlying uncertainties, the ways rare events arise are vastly different depending on whether the underlying tail distributions are light-tailed or heavy-tailed. Roughly speaking, in light-tailed settings, a system-wide rare event arises because everything goes wrong a little bit as if the entire system has conspired up to provoke the rare event (conspiracy principle), whereas, in heavy-tailed settings, a system-wide rare event arises because a small number of components fail catastrophically (catastrophe principle). In the first part of this talk, I will introduce the recent developments in the theory of large deviations for heavy-tailed stochastic processes at the sample path level and rigorously characterize the catastrophe principle for such processes.
The empirical success of deep learning is often attributed to the mysterious ability of stochastic gradient descents (SGDs) to avoid sharp local minima in the loss landscape, as sharp minima are believed to lead to poor generalization. To unravel this mystery and potentially further enhance such capability of SGDs, it is imperative to go beyond the traditional local convergence analysis and obtain a comprehensive understanding of SGDs' global dynamics within complex non-convex loss landscapes. In the second part of this talk, I will characterize the global dynamics of SGDs building on the heavy-tailed large deviations and local stability framework developed in the first part. This leads to the heavy-tailed counterparts of the classical Freidlin-Wentzell and Eyring-Kramers theories. Moreover, we reveal a fascinating phenomenon in deep learning: by injecting and then truncating heavy-tailed noises during the training phase, SGD can almost completely avoid sharp minima and hence achieve better generalization performance for the test data.
This talk is based on the joint work with Mihail Bazhba, Jose Blanchet, Bohan Chen, Sewoong Oh, Zhe Su, Xingyu Wang, and Bert Zwart.
Equivariantly O2-stable actions: classification and range of the invariant
Abstract
One possible version of the Kirchberg—Phillips theorem states that simple, separable, nuclear, purely infinite C*-algebras are classified by KK-theory. In order to generalize this result to non-simple C*-algebras, Kirchberg first restricted his attention to those that absorb the Cuntz algebra O2 tensorially. C*-algebras in this class carry no KK-theoretical information in a strong sense, and they are classified by their ideal structure alone. It should be mentioned that, although this result is in Kirchberg’s work, its full proof was first published by Gabe. In joint work with Gábor Szabó, we showed a generalization of Kirchberg's O2-stable theorem that classifies G-C*-algebras up to cocycle conjugacy, where G is any second-countable, locally compact group. In our main result, we assume that actions are amenable, sufficiently outer, and absorb the trivial action on O2 up to cocycle conjugacy. In very recent work, I moreover show that the range of the classification invariant, consisting of a topological dynamical system over primitive ideals, is exhausted for any second-countable, locally compact group.
In this talk, I will recall the classification of O2-stable C*-algebras, and describe their classification invariant. Subsequently, I will give a short introduction to the C*-dynamical working framework and present the classification result for equivariant O2-stable actions. Time permitting, I will give an idea of how one can build a C*-dynamical system in the scope of our classification with a prescribed invariant.
Transportation Cost Spaces and their embeddings in L_1 spaces
Abstract
Transportation cost spaces are of high theoretical interest, and they also are fundamental in applications in many areas of applied mathematics, engineering, physics, computer science, finance, and social sciences.
Obtaining low distortion embeddings of transportation cost spaces into L_1 became important in the problem of finding nearest points, an important research subject in theoretical computer science. After introducing
these spaces we will present some results on upper and lower estimates of the distortion of embeddings of Transportation Cost Spaces into L_1
Heavy-Tailed Large Deviations and Sharp Characterization of Global Dynamics of SGDs in Deep Learning
Abstract
While the typical behaviors of stochastic systems are often deceptively oblivious to the tail distributions of the underlying uncertainties, the ways rare events arise are vastly different depending on whether the underlying tail distributions are light-tailed or heavy-tailed. Roughly speaking, in light-tailed settings, a system-wide rare event arises because everything goes wrong a little bit as if the entire system has conspired up to provoke the rare event (conspiracy principle), whereas, in heavy-tailed settings, a system-wide rare event arises because a small number of components fail catastrophically (catastrophe principle). In the first part of this talk, I will introduce the recent developments in the theory of large deviations for heavy-tailed stochastic processes at the sample path level and rigorously characterize the catastrophe principle for such processes.
The empirical success of deep learning is often attributed to the mysterious ability of stochastic gradient descents (SGDs) to avoid sharp local minima in the loss landscape, as sharp minima are believed to lead to poor generalization. To unravel this mystery and potentially further enhance such capability of SGDs, it is imperative to go beyond the traditional local convergence analysis and obtain a comprehensive understanding of SGDs' global dynamics within complex non-convex loss landscapes. In the second part of this talk, I will characterize the global dynamics of SGDs building on the heavy-tailed large deviations and local stability framework developed in the first part. This leads to the heavy-tailed counterparts of the classical Freidlin-Wentzell and Eyring-Kramers theories. Moreover, we reveal a fascinating phenomenon in deep learning: by injecting and then truncating heavy-tailed noises during the training phase, SGD can almost completely avoid sharp minima and hence achieve better generalization performance for the test data.
This talk is based on the joint work with Mihail Bazhba, Jose Blanchet, Bohan Chen, Sewoong Oh, Zhe Su, Xingyu Wang, and Bert Zwart.