Main Content

Exact GPR Method

An instance of response y from a Gaussian process regression (GPR) model can be modeled as

P(yi|f(xi),xi) ~N(yi|h(xi)Tβ+f(xi),σ2)

Hence, making predictions for new data from a GPR model requires:

  • Knowledge of the coefficient vector, β, of fixed basis functions

  • Ability to evaluate the covariance function k(x,x|θ) for arbitrary x and x, given the kernel parameters or hyperparameters, θ.

  • Knowledge of the noise variance σ2 that appears in the density P(yi|f(xi),xi)

That is, one needs first to estimate β, θ, and σ2 from the data (X,y).

Parameter Estimation

One approach for estimating the parameters β, θ, and σ2 of a GPR model is by maximizing the likelihood P(y|X) as a function of β, θ, and σ2[1]. That is, if β^, θ^, and σ^2 are the estimates of β, θ, and σ2, respectively, then:

β^,θ^,σ^2=arg maxβ,θ,σ2logP(y|X,β,θ,σ2).

Because

P(y|X)=P(y|X,β,θ,σ2)=N(y|Hβ,K(X,X|θ)+σ2In),

the marginal log likelihood function is as follows:

logP(y|X,β,θ,σ2)=12(yHβ)T[K(X,X|θ)+σ2In]1(yHβ)n2log2π12log|K(X,X|θ)+σ2In|.

where H is the vector of explicit basis functions, and K(X,X|θ) is the covariance function matrix (for more information, see Gaussian Process Regression Models).

To estimate the parameters, the software first computes β^(θ,σ2), which maximizes the log likelihood function with respect to β for given θ and σ2. It then uses this estimate to compute the β-profiled likelihood:

log{P(y|X,β^(θ,σ2),θ,σ2)}.

The estimate of β for given θ, and σ2 is

β^(θ,σ2)=[ HT[K(X,X|θ)+σ2In]1 H]1HT[K(X,X|θ)+σ2In]1 y.

Then, the β-profiled log likelihood is given by

logP(y|X,β^(θ,σ2),θ,σ2)=12(yHβ^(θ,σ2))T[K(X,X|θ)+σ2In]1(yHβ^(θ,σ2))n2log2π12log|K(X,X|θ)+σ2In|

The software then maximizes the β-profiled log-likelihood over θ, and σ2 to find their estimates.

Prediction

Making probabilistic predictions from a GPR model with known parameters requires the density P(ynew|y,X,xnew). Using the definition of conditional probabilities, one can write:

P(ynew|y,X,xnew)=P(ynew,y|X,xnew)P(y|X,xnew).

To find the joint density in the numerator, it is necessary to introduce the latent variables fnew and f corresponding to ynew, and y, respectively. Then, it is possible to use the joint distribution for ynew, y, fnew, and f to compute P(ynew,y|X,xnew):

P(ynew,y|X,xnew)=P(ynew,y,fnew,f|X,xnew)dfdfnew=P(ynew,y|fnew,f,X,xnew)P(fnew,f|X,xnew)dfdfnew.

Gaussian process models assume that each response yi only depends on the corresponding latent variable fi and the feature vector xi. Writing P(ynew,y|fnew,f,X,xnew) as a product of conditional densities and based on this assumption produces:

P(ynew,y|fnew,f,X,xnew)=P(ynew|fnew,xnew)i=1nP(yi|f(xi),xi).

After integrating with respect to ynew, the result only depends on f and X:

P(y|f,X)=i=1nP(yi|fi,xi)=i=1nN(yi|h(xi)Tβ+fi,σ2).

Hence,

P(ynew, y|fnew, f, X, xnew)=P(ynew|fnew, xnew)P(y|f,X).

Again using the definition of conditional probabilities,

P(fnew,f|X,xnew)=P(fnew|f,X,xnew)*P(f|X,xnew),

it is possible to write P(ynew,y|X,xnew) as follows:

P(ynew,y|X,xnew)=P(ynew|fnew, xnew)P(y|f,X)P(fnew|f,X,xnew)P(f|X,xnew)dfdfnew.

Using the facts that

P(f|X,xnew)=P(f|X)

and

P(y|f,X)P(f|X)=P(y,f|X)=P(f|y,X)P(y|X),

one can rewrite P(ynew,y|X,xnew) as follows:

P(ynew,y|X,xnew)=P(y|X)P(ynew|fnew, xnew)P(f|y,X)P(fnew|f,X,xnew)dfdfnew.

It is also possible to show that

P(y|X,xnew)=P(y|X).

Hence, the required density P(ynew|y,X,xnew) is:

P(ynew|y,X,xnew)=P(ynew,y|X,xnew)P(y|X,xnew)=P(ynew,y|X,xnew)P(y|X)=P(ynew|fnew, xnew)(1)P(f|y,X)(2)P(fnew|f,X,xnew)(3)dfdfnew.

It can be shown that

(1)P(ynew|fnew,xnew)=N(ynew|h(xnew)Tβ+fnew,σnew2)

(2)P(f|y,X)=N(f|1σ2(Inσ2+K(X,X)1)1(yHβ),(Inσ2+K(X,X)1)1)

(3)P(fnew|f,X,xnew)=N(fnew|K(xnewT,X)K(X,X)1f,Δ),whereΔ=k(xnew,xnew)K(xnewT,X) K(X,X)1K(X,xnewT).

After the integration and required algebra, the density of the new response ynew at a new point xnew, given y, X is found as

P(ynew|y,X,xnew)=N(ynew|h(xnew)Tβ+μ,σnew2+Σ),

where

μ=K(xnewT,X)(K(X,X)+σ2In)1(yHβ)α

and

Σ=k(xnew,xnew)K(xnewT,X)(K(X,X)+σ2In)1K(X,xnewT).

The expected value of prediction ynew at a new point xnew given y, X, and parameters β, θ, and σ2 is

E(ynew|y, X,xnew,β,θ,σ2)= h(xnew)Tβ+ K(xnewT,X|θ)α= h(xnew)Tβ+i=1nαik(xnew,xi|θ),

where

α=(K(X,X|θ)+σ2In)1(yHβ).

Computational Complexity of Exact Parameter Estimation and Prediction

Training a GPR model with the exact method (when FitMethod is 'Exact') requires the inversion of an n-by-n kernel matrix K(X,X). The memory requirement for this step scales as O(n2) since K(X,X) must be stored in memory. One evaluation of logP(y|X) scales as O(n3). Therefore, the computational complexity is O(kn3), where k is the number of function evaluations needed for maximization and n is the number of observations.

Making predictions on new data involves the computation of α^. If prediction intervals are desired, this step could also involve the computation and storage of the Cholesky factor of (K(X,X)+σ2In) for later use. The computational complexity of this step using the direct computation of α^ is O(n3) and the memory requirement is O(n2).

Hence, for large n, estimation of parameters or computing predictions can be very expensive. The approximation methods usually involve rearranging the computation so as to avoid the inversion of an n-by-n matrix. For the available approximation methods, please see the related links at the bottom of the page.

References

[1] Rasmussen, C. E. and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press. Cambridge, Massachusetts, 2006.

See Also

|

Related Topics