Forthcoming events in this series


Mon, 26 Feb 2024

14:00 - 15:00
Lecture Room 3

Fantastic Sparse Neural Networks and Where to Find Them

Dr Shiwei Liu
(Maths Institute University of Oxford)
Abstract

Sparse neural networks, where a substantial portion of the components are eliminated, have widely shown their versatility in model compression, robustness improvement, and overfitting mitigation. However, traditional methods for obtaining such sparse networks usually involve a fully pre-trained, dense model. As foundation models become prevailing, the cost of this pre-training step can be prohibitive. On the other hand, training intrinsic sparse neural networks from scratch usually leads to inferior performance compared to their dense counterpart. 

 

In this talk, I will present a series of approaches to obtain such fantastic sparse neural networks by training from scratch without the need for any dense pre-training steps, including dynamic sparse training, static sparse with random pruning, and only masking no training. First, I will introduce the concept of in-time over-parameterization (ITOP) (ICML2021) which enables training sparse neural networks from scratch (commonly known as sparse training) to attain the full accuracy of dense models. By dynamically exploring new sparse topologies during training, we avoid the costly necessity of pre-training and re-training, requiring only a single training run to obtain strong sparse neural networks. Secondly, ITOP involves additional overhead due to the frequent change in sparse topology. Our following work (ICLR2022) demonstrates that even a naïve, static sparse network produced by random pruning can be trained to achieve dense model performance as long as our model is relatively larger. Moreover, I will further discuss that we can continue to push the extreme of training efficiency by only learning masks at initialization without any weight updates, addressing the over-smoothing challenge in building deep graph neural networks (LoG2022).

Mon, 19 Feb 2024

14:00 - 15:00
Lecture Room 3

This seminar has been cancelled

Mihai Badiu
(Department of Engineering Science University of Oxford)
Abstract

Data that have an intrinsic network structure can be found in various contexts, including social networks, biological systems (e.g., protein-protein interactions, neuronal networks), information networks (computer networks, wireless sensor networks),  economic networks, etc. As the amount of graphical data that is generated is increasingly large, compressing such data for storage, transmission, or efficient processing has become a topic of interest. 

In this talk, I will give an information theoretic perspective on graph compression. The focus will be on compression limits and their scaling with the size of the graph. For lossless compression, the Shannon entropy gives the fundamental lower limit on the expected length of any compressed representation. 
I will discuss the entropy of some common random graph models, with a particular emphasis on our results on the random geometric graph model. 
Then, I will talk about the problem of compressing a graph with side information, i.e., when an additional correlated graph is available at the decoder. Turning to lossy compression, where one accepts a certain amount of distortion between the original and reconstructed graphs, I will present theoretical limits to lossy compression that we obtained for the Erdős–Rényi and stochastic block models by using rate-distortion theory.

Mon, 12 Feb 2024

14:00 - 15:00
Lecture Room 3

Do Stochastic, Feel Noiseless: Stable Optimization via a Double Momentum Mechanism

Kfir Levy
(Technion – Israel Institute of Technology)
Abstract

The tremendous success of the Machine Learning paradigm heavily relies on the development of powerful optimization methods, and the canonical algorithm for training learning models is SGD (Stochastic Gradient Descent). Nevertheless, the latter is quite different from Gradient Descent (GD) which is its noiseless counterpart. Concretely, SGD requires a careful choice of the learning rate, which relies on the properties of the noise as well as the quality of initialization.

 It further requires the use of a test set to estimate the generalization error throughout its run. In this talk, we will present a new SGD variant that obtains the same optimal rates as SGD, while using noiseless machinery as in GD. Concretely, it enables to use the same fixed learning rate as GD and does not require to employ a test/validation set. Curiously, our results rely on a novel gradient estimate that combines two recent mechanisms which are related to the notion of momentum.

Finally, as much as time permits, I will discuss several applications where our method can be extended.

Mon, 05 Feb 2024

14:00 - 15:00
Lecture Room 3

Exploiting Symmetries for Learning in Deep Weight Spaces

Haggai Maron
(NVIDIA)
Abstract

Learning to process and analyze the raw weight matrices of neural networks is an emerging research area with intriguing potential applications like editing and analyzing Implicit Neural Representations (INRs), weight pruning/quantization, and function editing. However, weight spaces have inherent permutation symmetries – permutations can be applied to the weights of an architecture, yielding new weights that represent the same function. As with other data types like graphs and point clouds, these symmetries make learning in weight spaces challenging.

This talk will overview recent advances in designing architectures that can effectively operate on weight spaces while respecting their underlying symmetries. First, we will discuss our ICML 2023 paper which introduces novel equivariant architectures for learning on multilayer perceptron weight spaces. We first characterize all linear equivariant layers for their symmetries and then construct networks composed of these layers. We then turn to our ICLR 2024 work, which generalizes the approach to diverse network architectures using what we term Graph Metanetworks (GMN). This is done by representing input networks as graphs and processing them with graph neural networks. We show the resulting metanetworks are expressive and equivariant to weight space symmetries of the architecture being processed. Our graph metanetworks are applicable to CNNs, attention layers, normalization layers, and more. Together, these works make promising steps toward versatile and principled architectures for weight-space learning.

Mon, 29 Jan 2024

14:00 - 15:00
Lecture Room 3

Infectious diseases and their control - a modelling perspective

Samir Bhatt
(University of Copenhagen & Imperial College London)
Abstract

The COVID-19 pandemic has brought a spotlight to the field of infectious disease modelling, prompting widespread public awareness and understanding of its intricacies. As a result, many individuals now possess a basic familiarity with the principles and methodologies involved in studying the spread of diseases. In this presentation, I aim to deliver a somewhat comprehensive (and hopefully engaging) overview of the methods employed in infectious disease modelling, placing them within the broader context of their significance for government and public health policy.

 

I will navigate through applications of Spatial Statistics, Branching Processes, and Binary Trees in modelling infectious diseases, with a particular emphasis on integrating machine learning methods into these areas. The goal of this presentation is to take you on a broad tour of methods and their applications, offering a personal perspective by highlighting examples from my recent work.

Mon, 22 Jan 2024

14:00 - 15:00
Lecture Room 3

Kernel Limit of Recurrent Neural Networks Trained on Ergodic Data Sequences

Prof. Justin Sirignano
(Mathematical Institute University of Oxford)
Abstract

Mathematical methods are developed to characterize the asymptotics of recurrent neural networks (RNN) as the number of hidden units, data samples in the sequence, hidden state updates, and training steps simultaneously grow to infinity. In the case of an RNN with a simplified weight matrix, we prove the convergence of the RNN to the solution of an infinite-dimensional ODE coupled with the fixed point of a random algebraic equation. 
The analysis requires addressing several challenges which are unique to RNNs. In typical mean-field applications (e.g., feedforward neural networks), discrete updates are of magnitude O(1/N ) and the number of updates is O(N). Therefore, the system can be represented as an Euler approximation of an appropriate ODE/PDE, which it will converge to as N → ∞. However, the RNN hidden layer updates are O(1). Therefore, RNNs cannot be represented as a discretization of an ODE/PDE and standard mean-field techniques cannot be applied. Instead, we develop a fixed point analysis for the evolution of the RNN memory state, with convergence estimates in terms of the number of update steps and the number of hidden units. The RNN hidden layer is studied as a function in a Sobolev space, whose evolution is governed by the data sequence (a Markov chain), the parameter updates, and its dependence on the RNN hidden layer at the previous time step. Due to the strong correlation between updates, a Poisson equation must be used to bound the fluctuations of the RNN around its limit equation. These mathematical methods allow us to prove a neural tangent kernel (NTK) limit for RNNs trained on data sequences as the number of data samples and size of the neural network grow to infinity.

Mon, 15 Jan 2024

14:00 - 15:00
Lecture Room 3

On sketches and corruptions: devising adaptive randomized iterative methods for large linear systems

Elizaveta Rebrova
(Princeton University, NJ)
Abstract

When the data is large, or comes in a streaming way, randomized iterative methods provide an efficient way to solve a variety of problems, including solving linear systems, finding least square solutions, solving feasibility problems, and others. Randomized Kaczmarz algorithm for solving over-determined linear systems is one of the popular choices due to its efficiency and its simple, geometrically intuitive iterative steps. 
In challenging cases, for example, when the condition number of the system is bad, or some of the equations contain large corruptions, the geometry can be also helpful to augment the solver in the right way. I will discuss our recent work with Michal Derezinski and Jackie Lok on Kaczmarz-based algorithms that use external knowledge about the linear system to (a) accelerate the convergence of iterative solvers, and (b) enable convergence in the highly corrupted regime.

 

Mon, 27 Nov 2023

14:00 - 15:00
Lecture Room 6

Towards Reliable Solutions of Inverse Problems with Deep Learning

Prof. Matthias Ehrhardt
(University of Bath)
Abstract

Deep learning has revolutionised many scientific fields and so it is no surprise that state-of-the-art solutions to several inverse problems also include this technology. However, for many inverse problems (e.g. in medical imaging) stability and reliability are particularly important.

Furthermore, unlike other image analysis tasks, usually only a fairly small amount of training data is available to train image reconstruction algorithms.

Thus, we require tailored solutions which maximise the potential of all ingredients: data, domain knowledge and mathematical analysis. In this talk we discuss a range of such hybrid approaches and will encounter along the way connections to various topics like generative models, convex optimization, differential equations and equivariance.

Mon, 20 Nov 2023

14:00 - 15:00
Lecture Room 6

Meta Optimization

Prof. Elad Hazan
(Princeton University and Google DeepMind)
Abstract

How can we find and apply the best optimization algorithm for a given problem?   This question is as old as mathematical optimization itself, and is notoriously hard: even special cases such as finding the optimal learning rate for gradient descent is nonconvex in general. 

In this talk we will discuss a dynamical systems approach to this question. We start by discussing an emerging paradigm in differentiable reinforcement learning called “online nonstochastic control”. The new approach applies techniques from online convex optimization and convex relaxations to obtain new methods with provable guarantees for classical settings in optimal and robust control. We then show how this methodology can yield global guarantees for learning the best algorithm in certain cases of stochastic and online optimization. 

No background is required for this talk, but relevant material can be found in this new text on online control and paper on meta optimization.

 

Prof. Elad's Bio

Mon, 13 Nov 2023

14:00 - 15:00
Lecture Room 6

No Seminar

TBA
Abstract

TBA

Mon, 06 Nov 2023

14:00 - 15:00
Lecture Room 6
Mon, 30 Oct 2023

14:00 - 15:00
Lecture Room 6
Mon, 23 Oct 2023

14:00 - 15:00
Lecture Room 6

Tractable Riemannian Optimization via Randomized Preconditioning and Manifold Learning

Boris Shustin
(Mathematical Institute University of Oxford)
Abstract

Optimization problems constrained on manifolds are prevalent across science and engineering. For example, they arise in (generalized) eigenvalue problems, principal component analysis, and low-rank matrix completion, to name a few problems. Riemannian optimization is a principled framework for solving optimization problems where the desired optimum is constrained to a (Riemannian) manifold.  Algorithms designed in this framework usually require some geometrical description of the manifold, i.e., tangent spaces, retractions, Riemannian gradients, and Riemannian Hessians of the cost function. However, in some cases, some of the aforementioned geometric components cannot be accessed due to intractability or lack of information.


 

In this talk, we present methods that allow for overcoming cases of intractability and lack of information. We demonstrate the case of intractability on canonical correlation analysis (CCA) and on Fisher linear discriminant analysis (FDA). Using Riemannian optimization to solve CCA or FDA with the standard geometric components is as expensive as solving them via a direct solver. We address this shortcoming using a technique called Riemannian preconditioning, which amounts to changing the Riemannian metric on the constraining manifold. We use randomized numerical linear algebra to form efficient preconditioners that balance the computational costs of the geometric components and the asymptotic convergence of the iterative methods. If time permits, we also show the case of lack of information, e.g., the constraining manifold can be accessed only via samples of it. We propose a novel approach that allows approximate Riemannian optimization using a manifold learning technique.

 

Mon, 16 Oct 2023

14:00 - 15:00
Lecture Room 6
Mon, 09 Oct 2023

14:00 - 15:00
Lecture Room 6

Mathematics of transfer learning and transfer risk: from medical to financial data analysis

Prof. Xin Guo
(University of California Berkeley)
Abstract

Transfer learning is an emerging and popular paradigm for utilizing existing knowledge from  previous learning tasks to improve the performance of new ones. In this talk, we will first present transfer learning in the early diagnosis of eye diseases: diabetic retinopathy and retinopathy of prematurity.  

We will discuss how this empirical  study leads to the mathematical analysis of the feasibility and transferability  issues in transfer learning. We show how a mathematical framework for the general procedure of transfer learning helps establish  the feasibility of transfer learning as well as  the analysis of the associated transfer risk, with applications to financial time series data.

Mon, 19 Jun 2023

14:00 - 15:00
Lecture Room 6

ScreeNOT: Optimal Singular Value Thresholding and Principal Component Selection in Correlated Noise

Elad Romanov
Abstract

Principal Component Analysis (PCA) is a fundamental and ubiquitous tool in statistics and data analysis.
The bare-bones idea is this. Given a data set of n points y_1, ..., y_n, form their sample covariance S. Eigenvectors corresponding to large eigenvalues--namely directions along which the variation within the data set is large--are usually thought of as "important"  or "signal-bearing"; in contrast, weak directions are often interpreted as "noise", and discarded along the proceeding steps of the data analysis pipeline. Principal component (PC) selection is an important methodological question: how large should an eigenvalue be so as to be considered "informative"?
Our main deliverable is ScreeNOT: a novel, mathematically-grounded procedure for PC selection. It is intended as a fully algorithmic replacement for the heuristic and somewhat vaguely-defined procedures that practitioners often use--for example the popular "scree test".
Towards tackling PC selection systematically, we model the data matrix as a low-rank signal plus noise matrix Y = X + Z; accordingly, PC selection is cast as an estimation problem for the unknown low-rank signal matrix X, with the class of permissible estimators being singular value thresholding rules. We consider a formulation of the problem under the spiked model. This asymptotic setting captures some important qualitative features observed across numerous real-world data sets: most of the singular values of Y are arranged neatly in a "bulk", with very few large outlying singular values exceeding the bulk edge. We propose an adaptive algorithm that, given a data matrix, finds the optimal truncation threshold in a data-driven manner under essentially arbitrary noise conditions: we only require that Z has a compactly supported limiting spectral distribution--which may be a priori unknown. Under the spiked model, our algorithm is shown to have rather strong oracle optimality properties: not only does it attain the best error asymptotically, but it also achieves (w.h.p.) the best error--compared to all alternative thresholds--at finite n.

This is joint work with Matan Gavish (Hebrew University of Jerusalem) and David Donoho (Stanford). 

Mon, 12 Jun 2023

14:00 - 15:00
Lecture Room 6

Group-invariant tensor train networks for supervised learning

Nick Vannieuwenhoven
Abstract

Invariance under selected transformations has recently proven to be a powerful inductive bias in several machine learning models. One class of such models are tensor train networks. In this talk, we impose invariance relations on tensor train networks. We introduce a new numerical algorithm to construct a basis of tensors that are invariant under the action of normal matrix representations of an arbitrary discrete group. This method can be up to several orders of magnitude faster than previous approaches. The group-invariant tensors are then combined into a group-invariant tensor train network, which can be used as a supervised machine learning model. We applied this model to a protein binding classification problem, taking into account problem-specific invariances, and obtained prediction accuracy in line with state-of-the-art invariant deep learning approaches. This is joint work with Brent Sprangers.

Mon, 05 Jun 2023

14:00 - 15:00
Lecture Room 6

Embedded Deep Learning for Prediction and Control of Complex Turbulent Flows

Professor Jonathan F. MacArt
Abstract

Accurately predicting turbulent fluid mechanics remains a significant challenge in engineering and applied science. Reynolds-Averaged Navier–Stokes (RANS) simulations and Large-Eddy Simulation (LES) are generally accurate, though non-Boussinesq turbulence and/or unresolved multiphysical phenomena can preclude predictive accuracy in certain regimes. In turbulent combustion, flame–turbulence interactions lead to inverse-cascade energy transfer, which violates the assumptions of many RANS and LES closures. We survey the regime dependence of these effects using a series of high-resolution Direct Numerical Simulations (DNS) of turbulent jet flames, from which an intermediate regime of heat-release effects, associated with the hypothesis of an “active cascade,” is apparent, with severe implications for physics- and data-driven closure models. We apply adjoint-based data assimilation method to augment the RANS and LES equations using trusted (though not necessarily high-fidelity) data. This uses a Python-native flow solver that leverages differentiable-programming techniques, automatic construction of adjoint equations, and solver-in-the-loop optimization. Applications to canonical turbulence, shock-dominated flows, aerodynamics, and flow control are presented, and opportunities for reacting flow modeling are discussed.

Mon, 24 Apr 2023

14:00 - 15:00
Lecture Room 6

Fundamental limits of generative AI

Helmut Bölcskei
(ETH Zurich)
Abstract


Generative AI has seen tremendous successes recently, most notably the chatbot ChatGPT and the DALLE2 software creating realistic images and artwork from text descriptions. Underlying these and other generative AI systems are usually neural networks trained to produce text, images, audio, or video from text inputs. The aim of this talk is to develop an understanding of the fundamental capabilities of generative neural networks. Specifically and mathematically speaking, we consider the realization of high-dimensional random vectors from one-dimensional random variables through deep neural networks. The resulting random vectors follow prescribed conditional probability distributions, where the conditioning represents the text input of the generative system and its output can be text, images, audio, or video. It is shown that every d-dimensional probability distribution can be generated through deep ReLU networks out of a 1-dimensional uniform input distribution. What is more, this is possible without incurring a cost—in terms of approximation error as measured in Wasserstein-distance—relative to generating the d-dimensional target distribution from d independent random variables. This is enabled by a space-filling approach which realizes a Wasserstein-optimal transport map and elicits the importance of network depth in driving the Wasserstein distance between the target distribution and its neural network approximation to zero. Finally, we show that the number of bits needed to encode the corresponding generative networks equals the fundamental limit for encoding probability distributions (by any method) as dictated by quantization theory according to Graf and Luschgy. This result also characterizes the minimum amount of information that needs to be extracted from training data so as to be able to generate a desired output at a prescribed accuracy and establishes that generative ReLU networks can attain this minimum.

This is joint work with D. Perekrestenko and L. Eberhard



 

Mon, 27 Mar 2023

14:00 - 15:00
Lecture Room 6

No Seminar

Tbc
Abstract

Tbc

Mon, 06 Mar 2023

14:00 - 15:00
L6

A Matrix-Mimetic Tensor Algebra for Optimal Representations of Multiway Data

Elizabeth Newman
(Emory University )
Abstract

The data revolution has changed the landscape of computational mathematics and has increased the demand for new numerical linear algebra tools to handle the vast amount of data. One crucial task is data compression to capture the inherent structure of data efficiently. Tensor-based approaches have gained significant traction in this setting by exploiting multilinear relationships in multiway data. In this talk, we will describe a matrix-mimetic tensor algebra that offers provably optimal compressed representations of high-dimensional data. We will compare this tensor-algebraic approach to other popular tensor decomposition techniques and show that our approach offers both theoretical and numerical advantages.

Mon, 20 Feb 2023

14:00 - 15:00
L6

Gradient flows and randomised thresholding: sparse inversion and classification

Jonas Latz
(Heriot Watt University Edinburgh)
Abstract

Sparse inversion and classification problems are ubiquitous in modern data science and imaging. They are often formulated as non-smooth minimisation problems. In sparse inversion, we minimise, e.g., the sum of a data fidelity term and an L1/LASSO regulariser. In classification, we consider, e.g., the sum of a data fidelity term and a non-smooth Ginzburg--Landau energy. Standard (sub)gradient descent methods have shown to be inefficient when approaching such problems. Splitting techniques are much more useful: here, the target function is partitioned into a sum of two subtarget functions -- each of which can be efficiently optimised. Splitting proceeds by performing optimisation steps alternately with respect to each of the two subtarget functions.

In this work, we study splitting from a stochastic continuous-time perspective. Indeed, we define a differential inclusion that follows one of the two subtarget function's negative subdifferential at each point in time. The choice of the subtarget function is controlled by a binary continuous-time Markov process. The resulting dynamical system is a stochastic approximation of the underlying subgradient flow. We investigate this stochastic approximation for an L1-regularised sparse inversion flow and for a discrete Allen-Cahn equation minimising a Ginzburg--Landau energy. In both cases, we study the longtime behaviour of the stochastic dynamical system and its ability to approximate the underlying subgradient flow at any accuracy. We illustrate our theoretical findings in a simple sparse estimation problem and also in low- and high-dimensional classification problems.

 

Mon, 06 Feb 2023

14:00 - 15:00
L6

Constrained and Multirate Training of Neural Networks

Tiffany Vlaar
(McGill University )
Abstract

I will describe algorithms for regularizing and training deep neural networks. Soft constraints, which add a penalty term to the loss, are typically used as a form ofexplicit regularization for neural network training. In this talk I describe a method for efficiently incorporating constraints into a stochastic gradient Langevin framework for the training of deep neural networks. In contrast to soft constraints, our constraints offer direct control of the parameter space, which allows us to study their effect on generalization. In the second part of the talk, I illustrate the presence of latent multiple time scales in deep learning applications.

Different features present in the data can be learned by training a neural network on different time scales simultaneously. By choosing appropriate partitionings of the network parameters into fast and slow parts I show that our multirate techniques can be used to train deep neural networks for transfer learning applications in vision and natural language processing in half the time, without reducing the generalization performance of the model.

Mon, 23 Jan 2023

14:00 - 15:00
L6

Deep low-rank transport maps for Bayesian inverse problems

Sergey Dolgov
(University of Bath)
Abstract

Characterising intractable high-dimensional random variables is one of the fundamental challenges in stochastic computation. We develop a deep transport map that is suitable for sampling concentrated distributions defined by an unnormalised density function. We approximate the target distribution as the push-forward of a reference distribution under a composition of order-preserving transformations, in which each transformation is formed by a tensor train decomposition. The use of composition of maps moving along a sequence of bridging densities alleviates the difficulty of directly approximating concentrated density functions. We propose two bridging strategies suitable for wide use: tempering the target density with a sequence of increasing powers, and smoothing of an indicator function with a sequence of sigmoids of increasing scales. The latter strategy opens the door to efficient computation of rare event probabilities in Bayesian inference problems.

Numerical experiments on problems constrained by differential equations show little to no increase in the computational complexity with the event probability going to zero, and allow to compute hitherto unattainable estimates of rare event probabilities for complex, high-dimensional posterior densities.