Bayesian model reduction

Bayesian model reduction is a method for computing the evidence and posterior over the parameters of Bayesian models that differ in their priors.^[1]^[2] A full model is fitted to data using standard approaches. Hypotheses are then tested by defining one or more 'reduced' models with alternative (and usually more restrictive) priors, which usually – in the limit – switch off certain parameters. The evidence and parameters of the reduced models can then be computed from the evidence and estimated (posterior) parameters of the full model using Bayesian model reduction. If the priors and posteriors are normally distributed, then there is an analytic solution which can be computed rapidly. This has multiple scientific and engineering applications: these include scoring the evidence for large numbers of models very quickly and facilitating the estimation of hierarchical models (Parametric Empirical Bayes).

Theory

Consider some model with parameters $\theta$ and a prior probability density on those parameters $p(\theta )$ . The posterior belief about $\theta$ after seeing the data $p(\theta \mid y)$ is given by Bayes rule:

{\begin{aligned}p(\theta \mid y)&={\frac {p(y\mid \theta )p(\theta )}{p(y)}}\\p(y)&=\int p(y\mid \theta )p(\theta )\,d\theta \end{aligned}}

(1)

The second line of Equation 1 is the model evidence, which is the probability of observing the data given the model. In practice, the posterior cannot usually be computed analytically due to the difficulty in computing the integral over the parameters. Therefore, the posteriors are estimated using approaches such as MCMC sampling or variational Bayes. A reduced model can then be defined with an alternative set of priors ${\tilde {p}}(\theta )$ :

{\begin{aligned}{\tilde {p}}(\theta \mid y)&={\frac {p(y\mid \theta ){\tilde {p}}(\theta )}{{\tilde {p}}(y)}}\\{\tilde {p}}(y)&=\int p(y\mid \theta ){\tilde {p}}(\theta )\,d\theta \end{aligned}}

(2)

The objective of Bayesian model reduction is to compute the posterior ${\tilde {p}}(\theta \mid y)$ and evidence ${\tilde {p}}(y)$ of the reduced model from the posterior $p(\theta \mid y)$ and evidence $p(y)$ of the full model. Combining Equation 1 and Equation 2 and re-arranging, the reduced posterior ${\tilde {p}}(\theta \mid y)$ can be expressed as the product of the full posterior, the ratio of priors and the ratio of evidences:

{\begin{aligned}{\frac {{\tilde {p}}(\theta \mid y){\tilde {p}}(y)}{p(\theta \mid y)p(y)}}&={\frac {p(y\mid \theta ){\tilde {p}}(\theta )}{p(y\mid \theta )p(\theta )}}\\\Rightarrow {\tilde {p}}(\theta \mid y)&=p(\theta \mid y){\frac {{\tilde {p}}(\theta )}{p(\theta )}}{\frac {p(y)}{{\tilde {p}}(y)}}\end{aligned}}

(3)

The evidence for the reduced model is obtained by integrating over the parameters of each side of the equation:

\int {\tilde {p}}(\theta \mid y)\,d\theta =\int p(\theta \mid y){\frac {{\tilde {p}}(\theta )}{p(\theta )}}{\frac {p(y)}{{\tilde {p}}(y)}}\,d\theta =1

(4)

And by re-arrangement:

{\begin{aligned}1&=\int p(\theta \mid y){\frac {{\tilde {p}}(\theta )}{p(\theta )}}{\frac {p(y)}{{\tilde {p}}(y)}}\,d\theta \\&={\frac {p(y)}{{\tilde {p}}(y)}}\int p(\theta \mid y){\frac {{\tilde {p}}(\theta )}{p(\theta )}}\,d\theta \\\Rightarrow {\tilde {p}}(y)&=p(y)\int p(\theta \mid y){\frac {{\tilde {p}}(\theta )}{p(\theta )}}\,d\theta \end{aligned}}

(5)

Gaussian priors and posteriors

Under Gaussian prior and posterior densities, as are used in the context of variational Bayes, Bayesian model reduction has a simple analytical solution.^[1] First define normal densities for the priors and posteriors:

{\begin{aligned}p(\theta )&=N(\theta ;\mu _{0},\Sigma _{0})\\{\tilde {p}}(\theta )&=N(\theta ;{\tilde {\mu }}_{0},{\tilde {\Sigma }}_{0})\\p(\theta \mid y)&=N(\theta ;\mu ,\Sigma )\\{\tilde {p}}(\theta \mid y)&=N(\theta ;{\tilde {\mu }},{\tilde {\Sigma }})\\\end{aligned}}

(6)

where the tilde symbol (~) indicates quantities relating to the reduced model and subscript zero – such as $\mu _{0}$ – indicates parameters of the priors. For convenience we also define precision matrices, which are the inverse of each covariance matrix:

{\begin{aligned}\Pi &=\Sigma ^{-1}\\\Pi _{0}&=\Sigma _{0}^{-1}\\{\tilde {\Pi }}&={\tilde {\Sigma }}^{-1}\\{\tilde {\Pi }}_{0}&={\tilde {\Sigma }}_{0}^{-1}\\\end{aligned}}

(7)

The free energy of the full model $F$ is an approximation (lower bound) on the log model evidence: $F\approx \ln {p(y)}$ that is optimised explicitly in variational Bayes (or can be recovered from sampling approximations). The reduced model's free energy ${\tilde {F}}$ and parameters $({\tilde {\mu }},{\tilde {\Sigma }})$ are then given by the expressions:

{\begin{aligned}{\tilde {F}}&={\frac {1}{2}}\ln |{\tilde {\Pi }}_{0}\cdot \Pi \cdot {\tilde {\Sigma }}\cdot \Sigma _{0}|\\&-{\frac {1}{2}}(\mu ^{T}\Pi \mu +{\tilde {\mu }}_{0}^{T}{\tilde {\Pi }}_{0}{\tilde {\mu }}_{0}-\mu _{0}^{T}\Pi _{0}\mu _{0}-{\tilde {\mu }}^{T}{\tilde {\Pi }}{\tilde {\mu }})+F\\{\tilde {\mu }}&={\tilde {\Sigma }}(\Pi \mu +{\tilde {\Pi }}_{0}{\tilde {\mu }}_{0}-\Pi _{0}\mu _{0})\\{\tilde {\Sigma }}&=(\Pi +{\tilde {\Pi }}_{0}-\Pi _{0})^{-1}\\\end{aligned}}

(8)

Example

Consider a model with a parameter $\theta$ and Gaussian prior $p(\theta )=N(0,0.5^{2})$ , which is the Normal distribution with mean zero and standard deviation 0.5 (illustrated in the Figure, left). This prior says that without any data, the parameter is expected to have value zero, but we are willing to entertain positive or negative values (with a 99% confidence interval [−1.16,1.16]). The model with this prior is fitted to the data, to provide an estimate of the parameter $q(\theta )$ and the model evidence $p(y)$ .

To assess whether the parameter contributed to the model evidence, i.e. whether we learnt anything about this parameter, an alternative 'reduced' model is specified in which the parameter has a prior with a much smaller variance: e.g. ${\tilde {p}}_{0}=N(0,0.001^{2})$ . This is illustrated in the Figure (right). This prior effectively 'switches off' the parameter, saying that we are almost certain that it has value zero. The parameter ${\tilde {q}}(\theta )$ and evidence ${\tilde {p}}(y)$ for this reduced model are rapidly computed from the full model using Bayesian model reduction.

The hypothesis that the parameter contributed to the model is then tested by comparing the full and reduced models via the Bayes factor, which is the ratio of model evidences:

{\text{BF}}={\frac {p(y)}{{\tilde {p}}(y)}}

The larger this ratio, the greater the evidence for the full model, which included the parameter as a free parameter. Conversely, the stronger the evidence for the reduced model, the more confident we can be that the parameter did not contribute. Note this method is not specific to comparing 'switched on' or 'switched off' parameters, and any intermediate setting of the priors could also be evaluated.

Applications

Neuroimaging

Bayesian model reduction was initially developed for use in neuroimaging analysis,^[1]^[3] in the context of modelling brain connectivity, as part of the dynamic causal modelling framework (where it was originally referred to as post-hoc Bayesian model selection).^[1] Dynamic causal models (DCMs) are differential equation models of brain dynamics.^[4] The experimenter specifies multiple competing models which differ in their priors – e.g. in the choice of parameters which are fixed at their prior expectation of zero. Having fitted a single 'full' model with all parameters of interest informed by the data, Bayesian model reduction enables the evidence and parameters for competing models to be rapidly computed, in order to test hypotheses. These models can be specified manually by the experimenter, or searched over automatically, in order to 'prune' any redundant parameters which do not contribute to the evidence.

Bayesian model reduction was subsequently generalised and applied to other forms of Bayesian models, for example parametric empirical Bayes (PEB) models of group effects.^[2] Here, it is used to compute the evidence and parameters for any given level of a hierarchical model under constraints (empirical priors) imposed by the level above.

Neurobiology

Bayesian model reduction has been used to explain functions of the brain. By analogy to its use in eliminating redundant parameters from models of experimental data, it has been proposed that the brain eliminates redundant parameters of internal models of the world while offline (e.g. during sleep).^[5]^[6]

Software implementations

Bayesian model reduction is implemented in the Statistical Parametric Mapping toolbox, in the Matlab function spm_log_evidence_reduce.m .

References

^ ^a ^b ^c ^d Friston, Karl; Penny, Will (June 2011). "Post hoc Bayesian model selection". NeuroImage. 56 (4): 2089–2099. doi:10.1016/j.neuroimage.2011.03.062. ISSN 1053-8119. PMC 3112494. PMID 21459150.
^ ^a ^b Friston, Karl J.; Litvak, Vladimir; Oswal, Ashwini; Razi, Adeel; Stephan, Klaas E.; van Wijk, Bernadette C.M.; Ziegler, Gabriel; Zeidman, Peter (March 2016). "Bayesian model reduction and empirical Bayes for group (DCM) studies". NeuroImage. 128: 413–431. doi:10.1016/j.neuroimage.2015.11.015. ISSN 1053-8119. PMC 4767224. PMID 26569570.
^ Rosa, M.J.; Friston, K.; Penny, W. (June 2012). "Post-hoc selection of dynamic causal models". Journal of Neuroscience Methods. 208 (1): 66–78. doi:10.1016/j.jneumeth.2012.04.013. ISSN 0165-0270. PMC 3401996. PMID 22561579.
^ Friston, K.J.; Harrison, L.; Penny, W. (August 2003). "Dynamic causal modelling". NeuroImage. 19 (4): 1273–1302. doi:10.1016/s1053-8119(03)00202-7. ISSN 1053-8119. PMID 12948688. S2CID 2176588.
^ Friston, Karl J.; Lin, Marco; Frith, Christopher D.; Pezzulo, Giovanni; Hobson, J. Allan; Ondobaka, Sasha (October 2017). "Active Inference, Curiosity and Insight" (PDF). Neural Computation. 29 (10): 2633–2683. doi:10.1162/neco_a_00999. ISSN 0899-7667. PMID 28777724. S2CID 13354308.
^ Tononi, Giulio; Cirelli, Chiara (February 2006). "Sleep function and synaptic homeostasis". Sleep Medicine Reviews. 10 (1): 49–62. doi:10.1016/j.smrv.2005.05.002. ISSN 1087-0792. PMID 16376591.

[Friston1-1] Friston, Karl; Penny, Will (June 2011). "Post hoc Bayesian model selection". NeuroImage. 56 (4): 2089–2099. doi:10.1016/j.neuroimage.2011.03.062. ISSN 1053-8119. PMC 3112494. PMID 21459150.

[Friston2-2] Friston, Karl J.; Litvak, Vladimir; Oswal, Ashwini; Razi, Adeel; Stephan, Klaas E.; van Wijk, Bernadette C.M.; Ziegler, Gabriel; Zeidman, Peter (March 2016). "Bayesian model reduction and empirical Bayes for group (DCM) studies". NeuroImage. 128: 413–431. doi:10.1016/j.neuroimage.2015.11.015. ISSN 1053-8119. PMC 4767224. PMID 26569570.

[Rosa-3] Rosa, M.J.; Friston, K.; Penny, W. (June 2012). "Post-hoc selection of dynamic causal models". Journal of Neuroscience Methods. 208 (1): 66–78. doi:10.1016/j.jneumeth.2012.04.013. ISSN 0165-0270. PMC 3401996. PMID 22561579.

[Friston3-4] Friston, K.J.; Harrison, L.; Penny, W. (August 2003). "Dynamic causal modelling". NeuroImage. 19 (4): 1273–1302. doi:10.1016/s1053-8119(03)00202-7. ISSN 1053-8119. PMID 12948688. S2CID 2176588.

[Friston4-5] Friston, Karl J.; Lin, Marco; Frith, Christopher D.; Pezzulo, Giovanni; Hobson, J. Allan; Ondobaka, Sasha (October 2017). "Active Inference, Curiosity and Insight" (PDF). Neural Computation. 29 (10): 2633–2683. doi:10.1162/neco_a_00999. ISSN 0899-7667. PMID 28777724. S2CID 13354308.

[Tononi-6] Tononi, Giulio; Cirelli, Chiara (February 2006). "Sleep function and synaptic homeostasis". Sleep Medicine Reviews. 10 (1): 49–62. doi:10.1016/j.smrv.2005.05.002. ISSN 1087-0792. PMID 16376591.

[1]

[2]

[3]

[4]

[5]

[6]