Partial differential equations (PDEs), often regarded as the language of physics and engineering, encode how quantities such as velocity, temperature, pressure, or concentration evolve in space and time. From fluid dynamics and climate science to electromagnetism and materials science, PDEs provide the mathematical framework through which we model the real world.
Yet, even when the governing equations are known, predicting their behaviour can be challenging. Consider weather forecasting, where the motion of the atmosphere is governed by the Navier-Stokes equations together with thermodynamic laws. While large-scale circulation patterns can be simulated with remarkable accuracy, precise rain forecasts – a matter that has become increasingly relevant to me since living in the United Kingdom – are typically reliable only over short time horizons, often no more than an hour ahead. The difficulty lies in turbulence and small-scale effects: tiny variations can amplify rapidly, and many fine-scale processes, such as cloud microphysics, cannot be resolved to full extent, even on the most capable large-scale supercomputers. This challenge is not unique to meteorology.
In many scientific and engineering applications, PDE models contain terms that represent processes we cannot measure directly, cannot derive from first principles, or cannot resolve at the required accuracy. These terms encode missing, unresolved, or imperfectly understood physics, or reflect the practical limits of available computing power.
Increasingly, researchers and practitioners across science and engineering are turning to machine learning to enhance traditional physical models. This development has given rise to the field of scientific machine learning (SciML), where data-driven methods are integrated with physics-based modelling. The goal is to combine the flexibility of neural networks and the availability of large real-world datasets with well-established PDE models derived from first principles. In this framework, neural networks can be embedded directly into PDEs to represent missing or unresolved contributions. These neural components are then trained on simulation or observational data. Mathematically, this amounts to solving an inverse problem: learning neural network terms from data in order to approximate missing physics, while preserving the structural constraints imposed by the governing equations.
Despite their growing practical adoption and striking empirical successes, rigorous mathematical foundations of neural PDE models have remained limited. Training such neural network models embedded inside nonlinear PDEs leads to highly non-convex optimisation problems with complex loss landscapes. In general, non-convex optimisation problems offer no guarantees of global convergence: gradient-based methods may become trapped in local minima or saddle points, making it unclear whether training will discover a desirable global minimum.
In our research, we study an idealised but representative scenario, where a neural network $g_\theta^N(t,x)= \frac{1}{N^\beta}\sum_{i=1}^N c^i\sigma(w^{t,i}t + (w^i)^\top x + \eta^i)$ with a single hidden layer and $N$ neurons drives a nonlinear parabolic PDE of the form
\[\partial_t u_{\theta} + L u_{\theta} + \text{known physics} = g_{\theta}^N.\]
The resulting neural-network PDE model $u_{\theta}$, being a function of the trainable neural network parameters $\theta=(c^i,w^{t,i},w^i,\eta^i)_{i=1,\dots,N}$, can be calibrated to ground truth data $h$ by minimising the least-squares loss $J(\theta) = \frac{1}{2} \int_0^T\!\!\int_D (u_{\theta}(t,x) - h(t,x))^2 \,dxdt$. The training of the NN parameters $\theta$ is performed through continuous-time gradient descent $\frac{d}{d\tau} \theta_\tau = - \alpha^N_\tau \nabla_{\theta}J(\theta_\tau)$, where $\tau$ denotes the training time, $\alpha^N_\tau$ a suitable learning rate, and where the gradient $\nabla_{\theta}J(\theta) = \int_0^T\!\!\int_D \nabla_{\theta}g_{\theta}^N(t,x)\widehat{u}_{\theta}(t,x) \,dxdt$ is evaluated in a computationally efficient manner by solving an associated adjoint PDE $\widehat{u}_{\theta}$, the continuous analogue of backpropagation.
To investigate the convergence of this adjoint gradient descent method, we analyse the scaling limit where both the number of neurons and the training time tend to infinity. For a broad class of nonlinear parabolic PDEs with a neural network embedded in the source term, we prove that, in the infinite-width limit, the trained neural PDE solution $u_{\theta_\tau}$ converges to the target data $h$ as the training time $\tau\rightarrow\infty$, that is, to a global minimiser of the loss $L$. This analysis is mathematically delicate: in the infinite-width limit, the training dynamics is governed by a non-local kernel operator without a spectral gap, and the nonlinear coupling with the PDE preserves the intrinsic non-convexity of the optimisation problem. These features place the problem beyond the reach of classical neural network convergence theory and require the development of new analytical techniques.
Our paper Global Convergence of Adjoint-Optimized Neural PDEs, written together with Justin Sirignano (Oxford Mathematics) and Konstantinos Spiliopoulos (Boston University, Department of Mathematics & Statistics), is published in the Journal of Machine Learning Research (JMLR).
This research project is supported by DMS-EPSRC: Asymptotic Analysis of Online Training Algorithms in Machine Learning: Recurrent, Graphical, and Deep Neural Networks.
Konstantin Riedl is a Postdoctoral Research Associate in Deep Learning in Oxford Mathematics and a Research Fellow in Artificial Intelligence & Machine Learning at Reuben College.