9. Model checking#
Once we have accomplished the first two steps of the Bayesian analysis: construction of the probability model and computation of the posterior distribution, we should not ignore the (relatively simple) step of check how well our model explain the data and our knowledge of the phenomenon.
Because we know beforehand that our model can not include all aspects of reality, we can find out which which aspects are not capture. And find out the plausibility of our model to accomplish the purpose for which we built it in the first place. It is not about asking if our model is false or truth, but to know the principal deficiencies of our model.
9.1. Test statistics and frequentist \(p\)-value#
From the frequentist framework, we define a test statistic \(T(\mathbf{Y})\) as a statistic of the data which is used to compare the observed data against a replica generated by our model. In this way, the classic \(p\)-value is defined as
That is, \(p_C\) is the probability of getting a more extreme statistic that the observed one, with \(\theta\) fixed. Where \(\theta\) can be a “null” value in hypothesis tests or some estimation like the maximum likelihood estimator.
9.2. Bayesian context#
In the Bayesian framework, we take advantage of having access to a generative distribution given by the predictive distribution. If the model fits well to the data, then replicated data generated with our generative model should look similar to the observed data. In other words, the observed data should look plausible when we consider the posterior predictive distribution. Thus, our basic technique to check our model is to simulate data from the posterior predictive distribution and compare them with the observed data.
To have a fair comparison, the replicted data, \(\mathbf{Y}^{\text{rep}}\), must be (as the name suggests) replicas of the observed data. That is, \(\mathbf{Y}^{\text{rep}}\) must be of the same dimension that \(\mathbf{Y}\), and if our model has some predictor variables \(\mathbf{X}\), then we must use exactly the same values of the predictor variables.
9.3. Test quantity and Bayesian \(p\)-value#
To measure the discrepancies between the fitted model and the data, we define a test quantity \(T(\mathbf{Y},\theta)\). Becasue \(\theta\) is a random variable, the test quantity might depend not only on the data but also on the value of \(\theta\).
The Bayesian \(p\)-value is defined as the probability that the test quantity evaluated in the replicated data, \(T(\mathbf{Y}^{\text{rep}},\theta)\), is more extreme than the test quantity evaluated in the observed data.
Because \(p_B=\mathbb{E}\left[1_{T(\mathbf{Y}^{\text{rep}},\theta)\geq T(\mathbf{Y},\theta)}|\mathbf{Y}\right]\), then a way to estimate the \(p\)-value is simulating a sample \(\tilde\theta_1,\ldots,\tilde\theta_S\) frm the posterior distribution and \(S\) replicas, \(\mathbf{Y}^{\text{rep}}_1,\ldots,\mathbf{Y}^{\text{rep}}_S\) from the posterior predictive distribution and compute
9.4. Marginal predictive checks#
We can estimate the marginal predictive distribution \(p(Y_i|\mathbf{Y})\) and use it to check the model individually, which might help to find atypical data. That is, we can estimate
In the case where \(Y_i\) is continuous, we can take \(T(Y_i,\theta)=Y_i\). If these marginal \(p\)-values are concentrated around 0 and 1, then the data are overdispersed compared with the model. On the other hand, if the marginal \(p\)-values are concentrated around 0.5 the data have less dispersion that the estimated by the model.
Check the following codes in the repository of the course for examples of model checking.
10_HeightExampleGrid.ipynb. This example was taken from [McE18].
11_NewcombsLightSpeed.ipynb. This example was taken from [GCS+13].
12_CheckIndependenceBinom.ipynb. This example was taken from [GCS+13].
Statistical significance and practical significance
The objective of model checking is not to answer: our data come from the assumed model?, whose answer almost always will be no. But to quantify the dissimilarities between our data and the model, and to know when these dissimilarities would have accured at random, under the assumptions of the model.
If the model fails in an important aspect, we should change in changing the model, if not, we could ignore the fail if it does not affect the principal conclusions. The \(p\)-value measures “statistical significanse” not “practical significance”.