Evaluating and comparing models

16. Evaluating and comparing models#

16.1. Log pointwise predictive density (lppd)#

Let be \(p\) and \(q\) the posteriori predictive distributions under two different models \(\mathcal{P}\) and \(\mathcal{Q}\), and assume that the distribution of our data is \(f\). We prefer model \(\mathcal{P}\) over model \(\mathcal{Q}\) if the Kullback-Leibler divergence (KL divergence) between \(f\) and \(p\) is smaller that the KL divergence between \(f\) and \(q\). That is, we prefer \(\mathcal{P}\) over \(\mathcal{Q}\) if

\[\begin{split} \begin{align*} KL(f||p) &< KL(f||q) \\\\ \Leftrightarrow \mathbb{E}_{Y\sim f}[\log f(Y)] - \mathbb{E}_{Y\sim f}[\log p(Y|\mathbf{Y})] &< \mathbb{E}_{Y\sim f}[\log f(Y)]-\mathbb{E}_{Y\sim f}[\log q(Y|\mathbf{Y})] \\\\ \Leftrightarrow \mathbb{E}_{Y\sim f}[\log p(Y|\mathbf{Y})] &> \mathbb{E}_{Y\sim f}[\log q(Y|\mathbf{Y})] \end{align*} \end{split}\]

In the previous expression we have the problem of calculating an expected value with respect to the real distribution \(f\). To fix this problem we can use once again the data \(Y_1,\ldots, Y_n\), which we know that has density \(f\). Then, we prefer model \(\mathcal{P}\) over model \(\mathcal{Q}\) if

\[\begin{split} \begin{align*} \frac{1}{n}\sum_{i=1}^n \log p(Y_i|\mathbf{Y}) &> \frac{1}{n}\sum_{i=1}^n \log q(Y_i|\mathbf{Y}) \\ \Leftrightarrow \sum_{i=1}^n \log p(Y_i|\mathbf{Y}) &> \sum_{i=1}^n \log q(Y_i|\mathbf{Y}). \end{align*} \end{split}\]

On the other hand, remember that the posterior predictive distribution can be expressed as

\[p(Y|\mathbf{Y}) = \int_\Theta p(Y|\theta) p(\theta|\mathbf{Y})d\theta,\]

thus, we prefer model \(\mathcal{P}\) over model \(\mathcal{Q}\) if

\[\sum_{i=1}^n \log \int_\Theta p(Y_i|\theta) p(\theta|\mathbf{Y})d\theta > \sum_{i=1}^n \log \int_\Theta q(Y_i|\theta) q(\theta|\mathbf{Y})d\theta.\]

The quantity

\[\sum_{i=1}^n \log \int_\Theta p(Y_i|\theta) p(\theta|\mathbf{Y})d\theta\]

is called log pointwise predictive density (lppd).

16.2. Computed lppd#

If we have a sample \(\tilde\theta_1,\ldots,\tilde\theta_S\) from the posterior distribution, we can approximate

\[\int_\Theta p(Y|\theta) p(\theta|\mathbf{Y})d\theta\approx \frac{1}{S}\sum_{s=1}^S p(Y|\tilde\theta_s).\]

Therefore, we prefer model \(\mathcal{P}\) over model \(\mathcal{Q}\) if

\[\sum_{i=1}^n \log\left[\frac{1}{S}\sum_{s=1}^S p(Y_i|\tilde\theta_s)\right] > \sum_{i=1}^n \log\left[\frac{1}{S}\sum_{s=1}^S q(Y|\tilde\theta_s)\right].\]

I will called the quantity

\[\sum_{i=1}^n \log\left[\frac{1}{S}\sum_{s=1}^S p(Y_i|\tilde\theta_s)\right]\]

as the computed log pointwise predictive density (computed lppd).

Note that if \(S\) is sufficiently large, the computed lppd will approximate pretty well the lppd. For this reason, many authors define the lppd as this second expression.

In principle, we need to calculate the density of each observation \(Y_i\) in each samplled parameter \(\tilde\theta_s\), in practice we can face numerical problems, for this reason is convenient to express the computed lppd as

\[\begin{split} \begin{align*} \textsf{computed lppd} & = \sum_{i=1}^n \log\left[\frac{1}{S}\sum_{s=1}^S p(Y_i|\tilde\theta_s)\right] \\\\ & = \sum_{i=1}^n \left[\log\sum_{s=1}^S p(Y_i|\tilde\theta_s) - \log(S)\right] \\\\ & = \sum_{i=1}^n \left[\log\sum_{s=1}^S \exp\left\{\log p(Y_i|\tilde\theta_s)\right\} - \log(S)\right]. \end{align*} \end{split}\]

The function \(\log\sum\exp\{\cdot\}\) is usually available in many languages, keeping the numeric precision while being programmed efficiently, it is usually called logsumexp. Thus, we can compute the computed lppd as

\[\textsf{computed lppd}=\sum_{i=1}^n \left[\textsf{logsumexp} (\log p(Y_i|\tilde\theta_s)) - \log(S)\right]\]

The ideal case to evaluate a model is when we have access to a traininig dataset, which is used to fit the model, and a testing dataset, which is used to calculate the performance metrics of the model. That is, assume that we count not only with our training data \(Y_1,\ldots,Y_n\), but also that we have the testing data \(\tilde{Y}_1,\ldots,\tilde{Y}_m\)[^Note that while I’ve been using the notetation \(\tilde{Y}\) to refer to simulations of the variable \(Y\), in this context it means an observation from the testing data.], which is used to calculate our performance metric:

\[\textsf{computed lppd} = \sum_{i=1}^m \left[\textsf{logsumexp} (\log p(\tilde{Y}_i|\tilde\theta_s)) - \log(S)\right].\]

However, when we do not have a testing dataset, we must take into account that we are using twice our training dataset. Once to adjust the model, and again to evaluate the model. This makes the performance metric more optimistic for the moddel. Thus, the plan is to calculate the lppd and then add some penalization temr that fix the estimation and avoid the preference for overfit models. The metrics obtained with this procedure are known as information criteria.

16.3. Information criteria#

For historical reasons the metrics of the prediction of the model are called information criteria, and are typically defined in terms of the deviance. It is important to clarify that there is not a unique cocensus on how to define these information criteria, thus, their definitions present small variations in the literature. This is especially true for criteria that were developed and adapted from non-Bayesian frameworks.

16.3.1. Deviance#

Typically, the devaince is defined as -2 times the log-likelihood of the data with the parameters fixed in some value, that is \(-2\log p(\mathbf{Y}|\hat{\theta})\). However, from a Bayesian framework, some authors redefine it as -2 times the lppd. The -2 is there for historical reasons.

16.3.2. Akaike information criterion (AIC)#

The most well-known information criterion is the Akaike information criterion (AIC). The simplest correction to the lppd is based on the asymptotical behavior of the posterior distribution. Let be \(k\) the number of parameters estimated in the model. In the asymptotic case (or special cases when the normal linear model with known variance), we can correct the overestimation of the predictive power of the model by substracting \(k\) from the log-likelihood given the maximum likelihood estimator,

\[\log p(\mathbf{Y}|\hat\theta_{\text{mle}})-k.\]

The AIC is defined as the previous expression multiplied by -2

\[\text{AIC}=-2\log p(\mathbf{Y}|\hat\theta_{\text{mle}})+2k.\]

From a Bayesian perspective, some authors redefine the AIC as

\[\text{AIC}=-2\textsf{lppd}+2k.\]

When our model is beyond a linear model with uniform previous. It is inappropriate to simply substract the number of parameters. For models with informative priors or with a hierarchical structure, the effective number of parameters depend on the variance of the parameters at the group’s level.

16.3.3. Deviance information criterion (DIC)#

The deviance information criterion (DIC) is kind of a Bayesian version of the AIC, making two chanages. The first one is that the maximum likelihood estimator is replace by the posterior mean \(\hat{\theta}_{\text{Bayes}}=\mathbb{E}(\theta|\mathbf{Y})\), the second one is by replacing \(k\) by the effective number of parameters \(p_{\text{DIC}}\).

The DIC is then defined as

\[\text{DIC}=2\log p(\mathbf{Y}|\hat{\theta}_{\text{Bayes}})+2p_{\text{DIC}}.\]

There are at least two different approaches to define the effective number of parameters:

\[p_{\text{DIC}}=2\left(\log p(y|\hat{\theta}_{\text{Bayes}})-\mathbb{E}_{\theta\sim p(\theta|\mathbf{Y})} (\log p(\mathbf{Y}|\theta))\right),\]

using a posterior sample \(\tilde\theta_1,\ldots,\tilde\theta_S\), the previous expression can be approximated by

\[\textsf{computed } p_{\text{DIC}}=2\left(\log p(y|\hat{\theta}_{\text{Bayes}})-\frac{1}{S}\sum_{s=1}^S \log p(\mathbf{Y}|\tilde\theta_s)\right).\]

If the posterior mean is far from the posterior mode, there exists the problem that \(p_{\text{DIC}}\) might be negative.

An alternative approach for the effective number of parameters is given by

\[p_{\text{DIC alt}} = 2\mathbb{V}_{\theta\sim p(\theta|\mathbf{Y})}\left(\log p(\mathbf{Y}|\theta)\right),\]

which can be approximated by

\[\textsf{computed } p_{\text{DIC alt}}=2 \mathbb{V}_{s=1}^S \log p(\mathbf{Y}|\tilde\theta_s),\]

wher \(\mathbb{V}_{s=1}^S\) represents the sample variance

\[\mathbb{V}_{s=1}^S a_s = \frac{1}{S-1}\sum_{s=1}^S (a_s-\bar{a})^2.\]

16.3.4. Widely applicable information criterion or Watanabe-Akaike information criterion (WAIC)#

The widely applicable information criterion or the Watanabe-Akaike information criterion (WAIC) is defined as

\[\text{WAIC}=-2\textsf{lppd} + 2 p_{\text{WAIC}}.\]

This information criterion also accepts at least two different approaches to define the effective number of parameters.

\[\begin{split} \begin{align*} p_{\text{WAIC 1}} &= 2\sum_{i=1}^n\left(\log(\mathbb{E}_{\theta\sim p(\theta|\mathbf{Y})} p(Y_i|\theta)) - \mathbb{E}_{\theta\sim p(\theta|\mathbf{Y})}(\log p(Y_i|\theta))\right) \\\\ \textsf{computed }p_{\text{WAIC 1}} &= 2\sum_{i=1}^n\left(\log\left[\frac{1}{S}\sum_{s=1}^S p(Y_i|\tilde\theta_s)\right] - \frac{1}{S}\sum_{s=1}^S \log p(Y_i|\tilde\theta_s)\right) \\ & = 2 \textsf{ computed lppd} - \frac{2}{S}\sum_{i=1}^n\sum_{s=1}^S \log p(Y_i|\tilde\theta_s). \end{align*} \end{split}\]

The second approach to define the effective number of parameters is given by

\[p_{\text{WAIC 2}}=\sum_{i=1}^n\mathbb{V}_{\theta\sim p(\theta|\mathbf{Y})} \log p(Y_i|\tilde\theta_s),\]

\[\textsf{computed }p_{\text{WAIC 2}}=\sum_{i=1}^n\mathbb{V}_{s=1}^S \log p(Y_i|\tilde\theta_s).\]

16.3.5. Cross validation#

We can fix the optimism created by using twice the data through cross validation. However, cross validation demands too much computational resources, because it requieres to shatter the data and fit the model for each piece. The extreme case of leave-one-out-cross-validation (LOO-CV) requieres to shatter the data into \(n\) pieces.

16.3.6. Pareto-smoothed importance sampling cross-validation (PSIS)#

There exists cleversolutions to approximate cross validation without fitting the model several times. One solution is using the “importance” of each observation in the posterior distribution. Such “importance” means a higher impact in the posterior, if we remove an important observation, the posterior changes more. This method is known as Pareto-smoothed importance sampling cross-validation (PSIS).

16.4. Parallel experiments in eight schools#

In Chapter 14 we analyse the case of a parallel experiment in eitgh school. In the second part of the code 23_EducationalTestingExample.ipynb, you can find the computation of different information criteria for the fitted models.