Consider a regression problem with a set of data
which are vectors which describe the location of the datum in parameter space, which are the inputs for the problem, and \(y_i\), the outputs. The outputs may be noisy; in this work I will only consider situations where the noise is additive and Gaussian, so
where \(\sigma\) is the standard deviation of the noise, and \(f\) is the (latent) generating function of the data.
This regression problem can be addressed using Gaussian processes:
A gls:gaussian-process is a collection of random variables, any finite number of which have a joint Gaussian distribution cite:gpr.book.rw.
Where it is more conventional to consider a prior over a set of, for example, real values, such as a normal distribution, the Gaussian process forms a prior over the functions, \(f\) from equation ref:eq:gp:additive-noise, which might form the regression fit to any observed data. This assumes that the values of the function \(f\) behave as
where \(\mat{K}\) is the covariance matrix of \(\vec{x_1}\) and \(\vec{x_2}\), which can be calculated with reference to some covariance function, \(k\), such that \(K_{ij} = k(\vec{x}_i, \vec{x}_j)\). Note that I have assumed that the abbr:gp is a zero-mean process; this assumption is frequent within the literature. While this prior is initially untrained it still contains information about our preconceptions of the data through the form of the covariance function. For example, whether or not we expect the fit to be smooth, or periodic. Covariance functions will be discussed in greater detail in section ref:sec:gp:covariance.
By providing training data we can use Bayes theorem to update the Gaussian process, in the same way that the posterior distribution is updated by the addition of new data in a standard Bayesian context, and a posterior on the set of all possible functions to fit the data is produced. Thus, for a vector of test values of the generating function \(\vec{f}_\star\), the joint posterior \(p(\vec{f}, \vec{f}_* | \vec{y})\), given the observed outputs \(\vec{y}\) can be found by updating the abbr:gp prior on the training and test function values \(p(\vec{f}, \vec{f}_*)\) with the likelihood \(p(\vec{y}|\vec{f})\):
Finally the (latent) training-set function values, \(\vec{f}\) can be marginalised out:
We can take the mean of this posterior in the place of the ``best fit line’’ which other techniques produce, and then use the variance to produce an estimate of the uncertainty of the prediction.
Both the prior \(p(\vec{f}, \vec{f}_*)\) and the likelihood \(p(\vec{y}|\vec{f})\) are Gaussian:
with
and \(\mat{I}\) the identity matrix.
This leaves the form of the marginalised posterior being analytical:
Figures ref:fig:gp:training-data to ref:fig:gp:posterior-best show visually how a one-dimensional regressor can be created using an abbr:gp method, starting from a abbr:gp prior and (noisy) data.
The mean and variance of this posterior distribution can be used to form a regressor for the data, \(\set{D}\), with the mean taking the role of a ``line-of-best-fit’’ in conventional regression techniques, while the variance describes the goodness of that fit.
A graphical model of a abbr:gp is shown in figure ref:fig:gp:chain-diagram which illustrates an important property of the abpl:gp model: the addition (or removal) of any input point to the abbr:gp does not change the distribution of the other variables. This property allows outputs to be generated at arbitrary locations throughout the parameter space.
Gaussian processes trained with \(N\) training data require the ability to both store and invert an \(N\times N\) matrix of covariances between observations; this can be a considerable computational challenge.
Gaussian processes can be extended from the case of a single-dimensional input predicting a single-dimensional output to the ability to predict a multi-dimensional output from a multi-dimensional input cite:2011arXiv1106.6251A,Alvarez2011a,Bonilla2007.
The covariance function defines the similarity of a pair of data points, according to some relationship with suitable properties. The similarity of input data is assumed to be related to the similarity of the output, and therefore the more similar two inputs are the more likely their outputs are to be similar.
As such, the form of the covariance function represents prior knowledge about the data, and can encode understanding of effects such as periodicity within the data.
A stationary covariance function is a function \(f(\vec{x} - \vec{x}')\), and which is thus invariant to translations in the input space.
If a covariance function is a function of the form \(f(|\vec{x} - \vec{x}'|)\) then it is isotropic, and invariant under all rigid motions.
A covariance function which is both stationary and isotropic has the property that it can be expressed as a function of a single variable, \(r = | \vec{x} - \vec{x}' |\) is known as a abbr:rbf. Functions of the form \(k : (\vec{x}, \vec{x}') \to \mathbb{C}\), for two vectors \(\vec{x}, \vec{x}' \in \mathcal{X}\) are often known as kernels, and I will frequently refer interchangably to covariance functions and kernels where the covariance function has this form.
For a set of points \(\setbuilder{ \vec{x}_{i} | i = 1, \dots, n }\) a kernel, \(k\) can be used to construct the gram matrix, \(K_{i,j} = k(x_{i}, x_{j})\). If the kernel is also a covariance function then \(K\) is known as a covariance matrix.
For a kernel to be a valid covariance function for a abbr:gp it must produce a positive semidefinite covariance matrix \(\mat{K}\). Such a matrix, \(\mat{K} \in \mathbb{R}^{n \times n}\) must satisfy \(\vec{x}^{\transpose} \mat{K} \vec{x} \geq 0\) for all \(\vec{x} \in \mathbb{R}^{n}\).
One of the most frequently encountered covariance functions in the literature is the abbr:se covariance functions cite:gpr.book.rw. Perhaps as a result of its near-ubiquity this kernel is known under a number of similar, but confusing names (which are often inaccurate). These include the exponential quadratic, quadratic exponential, squared exponential, and even Gaussian covariance function.
The reason for this is its form, which closely resembles that of the Gaussian function:
for \(r\) the Euclidean distance of a datum from the centre of the parameter space, and \(l\) is a scale factor associated with the axis along which the data are defined.
The squared exponential function imposes strong smoothness constraints on the model, as it is infinitely differentiable.
The scale factor, \(l\) in ref:eq:gp:kernels:se, also known as its scale-length defines the size of the effect within the process. This characteristic length-scale can be understood cite:adler1976,gpr.book.rw in terms of the number of times the abbr:gp should cross some given level (for example, zero). Indeed, for a abbr:gp with a covariance function \(k\) which has well-defined first and second derivatives the expected number of times, \(N_{u}\) the process will cross a value \(u\) is
A zero-mean abbr:gp which has an abbr:se covariance structure will then cross zero \(1/(2 \pi l)\) times on average.
Examples of the squared exponential covariance function, and of draws from a Gaussian process prior which uses this covariance function are plotted in figure ref:fig:gp:covariance:overviews:se for a variety of different scale lengths.
For data which is not generated by a smooth function a suitable covariance function may be the exponential covariance function, \(k_{\mathrm{EX}}\), which is defined
where \(r\) is the pairwise distance between data and \(l\) is a length scale, as in equation ref:eq:gp:kernels:se.
Examples of the exponential covariance function, and of draws from a Gaussian process prior which uses this covariance function are plotted in figure ref:fig:gp:covariance:overviews:ex for a variety of different scale lengths.
For data generated by functions which are smooth, but not necessarily infinitely differentiable we may turn to the Matérn family of covariance functions, which take the form
for \(K_{\nu}\) the modified Bessel function of the second kind, and \(\Gamma\) the gamma function. As with the previous two covariance functions \(l\) is a scale length parameter, and \(r\) the distance between two data. A abbr:gp which has a Matérn covariance function will be \((\lceil x \rceil - 1)\)-times differentiable.
While determining an appropriate value of \(\nu\) during the training of the abbr:gp is possible, it is common to select a value a priori for this quantity. \(\nu=3/2\) and \(\nu=5/2\) are common choices as \(K_{\nu}\) can be determined simply, and the covariance functions are analytic.
The case with \(\nu=3/2\), commonly referred to as a Matérn-\(3/2\) kernel then becomes
Examples of this covariance function, and example draws from a abbr:gp using it as a covariance function are plotted in figure ref:fig:gp:kernels:m32.
Similarly, the Matérn-\(5/2\) is the case where \(\nu = 5/2\), taking the form
Again, examples of this covariance function, and example draws from a abbr:gp using it as a covariance function are plotted in figure ref:fig:gp:kernels:m52.
Data may also be generated from functions with variation on multiple scales. One approach to modelling such data is to use a abbr:gp with rational quadratic covariance. This covariance function represents a scale mixture of abbr:rbf covariance functions, each with a different characteristic length scale. The rational quadratic covariance function is defined as
where \(\alpha\) is a parameter which controls the weighting of small-scale compared to large-scale variations, and \(l\) and \(r\) are the overall length scale of the covariance and the distance between two data respectively. Examples of this function, at a variety of different length scales and \(\alpha\) values, and draws from abpl:gp which use these functions are plotted in figure ref:fig:gp:kernels:rq.
This summary of potential covariance functions for use with a abbr:gp is far from complete (see cite:gpr.book.rw for a more detailed list). However, these four can be used or combined to produce highly flexible regression models, as they can be added and multiplied as normal functions.