Modelling functions of sequential data with neural networks and the signature transform

Oxford Mathematician Patrick Kidger talks about his recent work on applying the tools of controlled differential equations to machine learning.

Sequential Data

The changing air pressure at a particular location may be thought of as a sequence in $\mathbb{R}$; the motion of a pen on paper may be thought of as a sequence in $\mathbb{R}^2$; the changes within financial markets may be thought of as a sequence in $\mathbb{R}^d$, with $d$ potentially very large.

The goal is often to learn some function of this data, for example to understand the weather, to classify what letter has been drawn, or to predict how financial markets will change.

In all of these cases, the data is ordered sequentially, meaning that it comes with a natural path-like structure: in general the data may be thought of as a discretisation of a path $f \colon [0, 1] \to V$, where $V$ is some Banach space. (In practice this is typically $V = \mathbb{R}^d$.)

The Signature Transform

When we know that data comes with some extra structure like this, we can seek to exploit that knowledge by using tools specifically adapated to the problem. For example, a tool for sequential data that is familiar to many people is the Fourier transform.

Here we use something similar, called the signature transform, which is famous for its use in rough path theory and controlled differential equations.

The signature transform has a rather complicated looking definition: \[ \mathrm{Sig}^N(f) = \left(\left(\,\underset{0 < t_1 < \cdots < t_k < 1}{\int \cdots \int} \prod_{j = 1}^k \frac{\mathrm{d}f_{i_j}}{\mathrm{d}t}(t_j)\mathrm{d}t_1\cdots\mathrm{d}t_k \right)_{1\leq i_1,\ldots, i_k \leq d}\right)_{1\leq k \leq N} \]

Whilst the Fourier transform extracts information about frequency, the signature transform instead extracts information about order and area. (It turns out that order and area are, in a certain sense, the same thing.)

Furthermore (and unlike the Fourier transform), order and area represent all possible nonlinear effects: the signature transform is a universal nonlinearity, meaning that every continuous function of the underlying path corresponds to just a linear function of its signature.

(Technically speaking, this is because the Fourier transform uses a basis for the space of paths, whilst the signature transform uses a basis for the space of functions of paths.)

Besides this, the signature transform has many other nice properties, such as robustness to missing or irregularly sampled data, optional translation invariance, and optional sampling invariance.

Applications to Machine Learning

Machine learning, and in particular neural networks, is famous for its many recent achievements, from image classification to self driving cars.

Given the great theoretical success of the signature transform, and the great empirical success of neural networks, it has been natural to try and bring these two together.

In particular, the problem of choosing activation functions and pooling functions for neural networks has usually been a matter of heuristics. Here, however, the theory behind the signature transform makes it a mathematically well-motivated choice of pooling function, specifically adapted to handle sequential data such as time series.

Bringing these two points of view together has been the purpose of the recent paper Deep Signature Transforms (accepted at NeurIPS 2019) by Patrick Kidger, Patric Bonnier, Imanol Perez Arribas, Cristopher Salvi, and Terry Lyons. Alongside this we have released Signatory, an efficient implementation of the signature transform capable of integrating with modern deep learning frameworks.