15. Hierarchical models#

15.1. General framework#

In their simplest version, in hierarchical models the parameters \(\theta_j\) come from a random sample of the prior distribution, which is determined by a vector of hyperparamters \(\phi\), that is

\[p(\theta|\phi)=\prod_{j=1}^J p(\theta_j|\phi).\]

In general \(\phi\) is unknown and, this, has its own (hyper)prior distribution \(p(\phi)\).

Thus, the joint prior distribution is

\[p(\theta,\phi)=p(\theta|\phi)p(\phi),\]

and the joint posterior distribution satisfies that

\[\begin{split} \begin{align*} p(\theta,\phi|\mathbf{Y}) & \propto p(\theta,\phi) p(\mathbf{Y}|\theta,\phi) \\ & = p(\theta,\phi) p(\mathbf{Y}|\theta) \\ & = p(\theta|\phi) p(\phi) p(\mathbf{Y}|\theta). \end{align*} \end{split}\]

When specifying a prior distribution for \(\phi\), we have to take special care when an improper distribution is used and check that the posterior distribution exists, this usually requires a tremendous mathematical effort. We can avoid this problem if we restrict the hyperprior to be a proper distribution.

15.1.1. Posterior conditional distribution of parameters, \(p(\theta|\phi,\mathbf{Y})\)#

Note that the posterior conditional of \(\theta\) satisfies

\[\begin{split} \begin{align*} p(\theta|\phi,\mathbf{Y}) &\propto p(\theta|\phi) p(\mathbf{Y}|\theta) \\ & = f(\phi,\mathbf{Y}) p(\theta|\phi) p(\mathbf{Y}|\theta). \end{align*} \end{split}\]

where \(f(\phi,\mathbf{Y})\) is the “constant” of proportionality, which depens on \(\phi\) and \(\mathbf{Y}\).

15.1.2. Posterior distribution of hyperparameters, \(p(\phi|\mathbf{Y})\)#

The posterior of \(\phi\) can be calculated marginalizing the joint posterior with respect to \(\theta\), that is

\[p(\phi|\mathbf{Y})=\int p(\theta,\phi|\mathbf{Y})d\theta.\]

Or, using the formula of conditional probability:

\[\begin{split} \begin{align*} p(\phi|\mathbf{Y}) &= \frac{p(\theta,\phi|\mathbf{Y})}{p(\theta|\phi,\mathbf{Y})} \\\\ &\propto \frac{p(\theta|\phi) p(\phi) p(\mathbf{Y}|\theta)}{f(\phi,\mathbf{Y}) p(\theta|\phi) p(\mathbf{Y}|\theta)} \\\\ &\propto \frac{p(\phi)}{f(\phi,\mathbf{Y})} \end{align*} \end{split}\]

15.2. Rat tumor#

This example was taken from section 3.7 of [GCS+13]. The data for the example was taken from [Tar82]. The code with all the details is 24_RatTumorEmpiricalFullBayes.ipynb in the repository of the course.

As commented previously, in the evaluation of drugs for possible clinical application, studies are routinely performed on rodents. For a particular study the immediate aim is to estimate \(\theta\), the probability of tumor in a population of female laboratory rats that receive a zero dose of the drug (a control group). The data show that 4 out of 14 rats developed endometrial stromal polyps (a kind of tumor). It is natural to assume a binomial model for the number of tumors, given \(\theta\). For convenience, we select a prior distribution for \(\theta\) from the conjugate family, \(\theta\sim\textsf{Beta}(\alpha,\beta)\).

15.2.1. Empirical Bayes#

Using the historical data and the method of moments we can set some values for \(\alpha\) and \(\beta\). Since \(\theta\sim\textsf{Beta}(\alpha,\beta)\), then

\[\mathbb{E}(\theta)=\frac{\alpha}{\alpha+\beta}\text{ and }\mathbb{V}(\theta)=\frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}.\]

Let be \(\bar{Y}_j=Y_j/n_j\), \(\bar{\bar Y}\) the mean of \(\bar Y_1,\ldots,\bar Y_J\), and \(s_{\bar{Y}}^2\) their variance.

Then, we have to find the values of \(\alpha\) and \(\beta\) such that

\[\bar{\bar Y}=\frac{\alpha}{\alpha+\beta} \text{ and }s_{\bar{Y}}^2=\frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}.\]

Solving these equations, we find that

\[\beta = \alpha\left(\frac{1-\bar{\bar Y}}{\bar{\bar Y}}\right)\text{ and }\alpha = \frac{\bar{\bar Y}^2(1-\bar{\bar Y})-\bar{\bar Y}s_{\bar{Y}}^2}{s_{\bar{Y}}^2}.\]

Using this simple estimate of the historical population distribution as a prior distribution for the current experiment yields a Beta(\(5.356,18.615\)) posterior distribution for \(\theta\).

The next figure shows the posterior distribuion of \(\theta\), with a vertical line is the observed proportion 4/14.

PosteriorThetaEmpirical

Note that the prior information has resulted in a posterior distribution substantially lower than the crude proportion 4/14.

15.2.2. Full Bayesian approach#

When we use the estimated values for \(\alpha\) and \(\beta\), we act as if those values where the real ones, which eliminates all the uncertainty that we have on the hyperparameters. Instead of this empirical approach, we can choose for a full Bayesian approach and assign a hyperprior for \(\alpha\) and \(\beta\). Assume that we have \(J\) historical data, and model

\[Y_j|\theta_j\sim\textsf{Binomial}(n_j,\theta_j),\]

with the number of rats, \(n_j\) known, and

\[\theta_j\sim\textsf{Beta}(\alpha,\beta).\]

We just need to specify a prior for \((\alpha,\beta)\), but we have to check that the posterior exists.

\[\begin{split} \begin{align*} p(\theta,\alpha,\beta|\mathbf{Y}) &\propto p(\alpha,\beta)p(\theta|\alpha,\beta)p(\mathbf{Y}|\theta) \\\\ &\propto p(\alpha,\beta)\prod_{j=1}^J \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\theta_j^{\alpha-1}(1-\theta_j)^{\beta-1}\prod_{j=1}^J \theta_j^{Y_j}(1-\theta_j)^{1-Y_j}. \end{align*} \end{split}\]

Thus,

\[p(\theta|\alpha,\beta,\mathbf{Y}) \propto \prod_{j=1}^J \theta_j^{\alpha+Y_j-1}(1-\theta_j)^{\beta+n_j-Y_j-1}.\]

That is, given \(\alpha\) and \(\beta\), the elements of \(\theta\) are independent and follow a beta distribution

\[\theta_j|\alpha,\beta,\mathbf{Y}\sim\textsf{Beta}(\alpha+Y_j,\beta+n_j-Y_j),\]

so

\[p(\theta|\alpha,\beta,\mathbf{Y}) = \prod_{j=1}^J \frac{\Gamma(\alpha+\beta+n_j)}{\Gamma(\alpha+Y_j)\Gamma(\beta+n_j-Y_j)}\theta_j^{\alpha+Y_j-1}(1-\theta_j)^{\beta+n_j-Y_j-1}.\]

Expressing \(p(\theta|\alpha,\beta,\mathbf{Y})\) as:

\[p(\theta|\alpha,\beta,\mathbf{Y}) = f(\alpha,\beta,\mathbf{Y})p(\theta|\alpha,\beta)p(\mathbf{Y}|\theta),\]

we get that

\[f(\alpha,\beta,\mathbf{Y}) = \prod_{j=1}^J \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha+\beta)}\frac{\Gamma(\alpha+\beta+n_j)}{\Gamma(\alpha+Y_j)\Gamma(\beta+n_j-Y_j)}\left[\binom {n_j}{Y_j}\right]^{-1}.\]

Then,

\[p(\alpha,\beta|\mathbf{Y})\propto p(\alpha,\beta)\prod_{j=1}^J\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\frac{\Gamma(\alpha+Y_j)\Gamma(\beta+n_j-Y_j)}{\Gamma(\alpha+\beta+n_j)}.\]

In [GCS+13], the authors proposed the noninformative prior \(p(\alpha,\beta)\propto(\alpha+\beta)^{-5/2}\), which yields a proper posterior distribution.

The next figure shows the marginal posterior distributions of \(\alpha\) and \(\beta\), the vertical lines represent their empirical estimators. Note that in both cases they understimate the possible values for \(\alpha\) and \(\beta\).

PosteriorThetaEmpirical

In the next figure I present the joint posterior of \(\alpha\) and \(\beta\), note that the joint posterior is far from a gaussian distribution. This is a typical behavior for hyperparameters in hierarchical models.

PosteriorThetaEmpirical

Finally the nect image shows credible intervals for all the \(\theta_j\), \(j=1,\ldots,J\) against their observed proportions. Note that cases where the proportion is low tend to be move upward, and viceversa, when the proportion is high the posterior estimator tends to move downward

PosteriorThetaEmpirical

15.3. Normal hierarchical model#

For this section I recommend to check:

15.4. Parallel experiments in eight schools#

This example was taken from section 5.5 of [GCS+13]. The code with all the details is 25_EducationalTestingExample.ipynb in the repository of the course.

A study was performed for the Educational Testing Service to analyze the effects of special coaching programs on test scores. Separate randomized experiments were performed to estimate the effects of coaching programs for the SAT-V (Scholastic Aptitude Test-Verbal) in each of eight high schools. We consider the sampling variances known, for all practical purposes, because the sample sizes in all the eight experiments were relatively large.

15.4.1. Separate Estimates#

We can consider each experiment separately with the following model:

\[Y_j|\theta_j\sim\textsf{Normal}(\theta_j,\sigma_j^2)\]

and

\[p(\theta_j)\propto 1_{\mathbb{R}}(\theta_j),\]

so the posterior would be

\[\theta_j|Y_j\sim\textsf{Normal}(Y_j,\sigma_j^2).\]
SeparateModel

Treatening each experiment separately yields 95% posterior intervals which overlap substantially.

15.4.2. Complete pooled estimate#

The general overlap in the posterior intervals based on independent analyses suggests that all experiments might be estimating the same quantity. Under the hypothesis that all experiments have the same effect and produce independent estimates of this common effect, we could treat the data as eight normally distributed observations:

\[Y_j\sim\textsf{Normal}(\theta,\sigma_j^2).\]

If we model \(p(\theta)\propto 1_{\mathbb{R}}(\theta)\) then \(\theta|\mathbf{Y}\sim\textsf{Normal}(\bar{Y}_{\cdot\cdot}, \varphi^2)\), where

\[\bar{Y}_{\cdot\cdot}=\frac{\sum_{j=1}^J \frac{1}{\sigma_j^2}Y_j}{\sum_{j=1}^J \frac{1}{\sigma_j^2}}\]

and

\[\varphi^2=\frac{1}{\sum_{j=1}^J\frac{1}{\sigma_j^2}}.\]

To know the goodness-of-fit of the model we can consider the posterior predictive, which would be given by $\(Y_j|\mathbf{Y}\sim\textsf{Normal}(\bar{Y}_{\cdot\cdot}, \sigma_j^2+\varphi^2)\)$

PooledModel

From the previous figure we can see that the complete pooling might be a good model to explain the data.

15.4.3. Hierarchical model#

We get the MAP for \(\tau\) and calculate an approximate credible interval of 0.95 of posterior probability using the asymptotic behavior of the posterior distribution. In the enxt figure we show the posterior distribution of \(\tau\), where we can observe that the MAP estimator correspond with \(\tau=0\), that is, with the complete pooling model.

PosteriorTau

We get a full Bayesian estimate for \(\mu\) using \(\hat{\mu}\) evaluated in the MAP of \(\tau\). In the next figure I present its conditional posterior distribution setting \(\tau=\tau^{\text{MAP}}\).

ConditionalPosteriorMu

The next plots show the effect of \(\tau\) in the posterior means of treatment effects, \(E(\theta_j|\sigma,\tau,\mathbf{Y})\) and the posterior standard deviation of treatment effects, \(\textsf{Std}(\theta_j|\sigma,\tau,\mathbf{Y})\)

PosteriorMean PosteriorStd

15.5. Meta-analysis#

Example taken from section 5.6 of [GCS+13]. The original data for the example are from [YPL+85]. The code with all the details is 26_MetaAnalysisCombiningStudies.ipynb in the repository of the course.

Meta-analysis is a process of summarizing and integrating the findings of research studies in a particular area.

The data in the example summarize mortality after myocardial infarction in 22 clinical trials, each consisting of two groups of heart attack patients randomly allocated to receive or not receive beta-blockers (a family of drugs that affect the central nervous system and can relax the heart muscles). The aim of a meta-analysis is to provide a combined analysis of the studies that indicates the overall strength of the evidence for a beneficial effect of the treatment under study.

If clinical trial \(j\) involves the use of \(n_{0j}\) subjects in the control group and \(n_{1j}\) in the treatment group, giving rise to \(y_{0j}\) and \(y_{1j}\) deaths in control and treatment groups, respectively, then the usual sampling model involves two independent binomial distributions with probabilities \(p_{0j}\) and \(p_{1j}\), respectively. Estimands of interest include the odds ratio

\[\rho_j=\frac{\frac{p_{1j}}{1-p_{1j}}}{\frac{p_{0j}}{1-p_{0j}}}.\]

However, for a number of reasons, including interpretability and the fact that its posterior distribution is close to normality, we concentrate on inference for the logarithm of the odds ratio, which we label \(\theta_j=\log\rho_j\).

We calculate the logarithms of the odds:

\[y_j=\log\left(\frac{y_{1j}}{n_{1j}-y_{1j}}\right)-\log\left(\frac{y_{0j}}{n_{0j}-y_{0j}}\right),\]

with approximate variance

\[\sigma_j^2=\frac{1}{y_{1j}}+\frac{1}{n_{1j}-y_{1j}}+\frac{1}{y_{0j}}+\frac{1}{n_{0j}-y_{0j}}.\]

For this data we are not considering the uncertainty in the variance. That is, we will act as if the variance is known and given by the observed one. Thus, we use the model:

\[\begin{split} \begin{align*} y_j|\theta_j,\sigma_j^2 &\sim \textsf{Normal}(\theta_j,\sigma_j^2) \\ \theta_j &\sim \textsf{Normal}(\mu,\tau^2) \\ p(\mu) &\propto 1_{\mathbb{R}}(\mu) \\ p(\tau) &\propto 1_{(0,\infty)}(\tau) \end{align*} \end{split}\]

The next table presents the data, the posterior median for each \(\theta\) and its corresponding ETI.

Study ControlDeaths ControlTotal TreatedDeaths TreatedTotal LogOdds Sigma ETI 2.5% Median ETI 97.5%
0 1 3 39 3 38 0.028 0.850 -0.596 -0.244 0.149
1 2 14 116 7 114 -0.741 0.483 -0.668 -0.286 0.014
2 3 11 93 5 69 -0.541 0.565 -0.623 -0.270 0.034
3 4 127 1520 102 1533 -0.246 0.138 -0.438 -0.248 -0.043
4 5 27 365 28 355 0.069 0.281 -0.448 -0.193 0.155
5 6 6 52 4 59 -0.584 0.676 -0.605 -0.261 0.074
6 7 152 939 98 945 -0.512 0.139 -0.619 -0.367 -0.168
7 8 48 471 60 632 -0.079 0.204 -0.418 -0.201 0.067
8 9 37 282 25 278 -0.424 0.274 -0.606 -0.287 -0.017
9 10 188 1921 138 1916 -0.335 0.117 -0.485 -0.297 -0.123
10 11 52 583 64 873 -0.213 0.195 -0.483 -0.241 0.027
11 12 47 266 45 263 -0.039 0.229 -0.434 -0.201 0.098
12 13 16 293 9 291 -0.593 0.425 -0.661 -0.282 0.020
13 14 45 883 57 858 0.282 0.205 -0.339 -0.095 0.276
14 15 31 147 25 154 -0.321 0.298 -0.547 -0.264 0.025
15 16 38 213 33 207 -0.135 0.261 -0.475 -0.231 0.066
16 17 12 122 28 251 0.141 0.364 -0.464 -0.203 0.165
17 18 6 154 8 151 0.322 0.553 -0.523 -0.217 0.182
18 19 3 134 6 174 0.444 0.717 -0.544 -0.225 0.151
19 20 40 218 32 209 -0.218 0.260 -0.510 -0.240 0.044
20 21 43 364 27 391 -0.591 0.257 -0.669 -0.322 -0.087
21 22 39 674 22 680 -0.608 0.272 -0.662 -0.317 -0.063

The next table shows the posterior median and ETI for \(\theta,\mu,\exp\{\mu\}\) and \(\tau\).

Parameter

ETI 2.5%

Median

ETI 97.5%

\(\theta\)

-0.626

-0.249

0.135

\(\mu\)

-0.376

-0.248

-0.115

\(\exp\{\mu\}\)

0.687

0.78

0.892

\(\tau\)

0.047

0.139

0.308

In the original published discussion of these data, it was remarked that it seems and ‘unusually narrow range of uncertainty’. The Bayesian analysis suggests that this was due to the use of an inappropriate model that had the effect of claiming all the studies were identical. In mathematical terms, complete pooling makes the assumption that the parameter \(\tau\) is exactly zero, whereas the data supply evidence that \(\tau\) might be close to zero, but might also plausibly be as high as 0.3. The next table shows the posterior median and ETI of \(\mu\) when \(\tau=0\).

Parameter

ETI 2.5%

Median

ETI 97.5%

\(\mu\vert\tau=0\)

-0.359

-0.26

-0.161

\(\exp\{\mu\}\vert\tau=0\)

0.699

0.771

0.851

We make a scatterplot of the crude effect estimates versus the posterior median effect estimates for the 22 studies, the color of each point depends on the size of the population in the study. We observe that the studies with the smallest sample sizes are partially pooled the most toward the mean \(\mu\).

MetaAnalysis