Hierarchical models

15. Hierarchical models#

15.1. General framework#

In their simplest version, in hierarchical models the parameters $\theta_j$ come from a random sample of the prior distribution, which is determined by a vector of hyperparamters $\phi$, that is

\[p(\theta|\phi)=\prod_{j=1}^J p(\theta_j|\phi).\]

In general $\phi$ is unknown and, this, has its own (hyper)prior distribution $p(\phi)$.

Thus, the joint prior distribution is

\[p(\theta,\phi)=p(\theta|\phi)p(\phi),\]

and the joint posterior distribution satisfies that

\[\begin{split} \begin{align*} p(\theta,\phi|\mathbf{Y}) & \propto p(\theta,\phi) p(\mathbf{Y}|\theta,\phi) \\ & = p(\theta,\phi) p(\mathbf{Y}|\theta) \\ & = p(\theta|\phi) p(\phi) p(\mathbf{Y}|\theta). \end{align*} \end{split}\]

When specifying a prior distribution for $\phi$, we have to take special care when an improper distribution is used and check that the posterior distribution exists, this usually requires a tremendous mathematical effort. We can avoid this problem if we restrict the hyperprior to be a proper distribution.

15.1.1. Posterior conditional distribution of parameters, $p(\theta|\phi,\mathbf{Y})$#

Note that the posterior conditional of $\theta$ satisfies

\[\begin{split} \begin{align*} p(\theta|\phi,\mathbf{Y}) &\propto p(\theta|\phi) p(\mathbf{Y}|\theta) \\ & = f(\phi,\mathbf{Y}) p(\theta|\phi) p(\mathbf{Y}|\theta). \end{align*} \end{split}\]

where $f(\phi,\mathbf{Y})$ is the “constant” of proportionality, which depens on $\phi$ and $\mathbf{Y}$.

15.1.2. Posterior distribution of hyperparameters, $p(\phi|\mathbf{Y})$#

The posterior of $\phi$ can be calculated marginalizing the joint posterior with respect to $\theta$, that is

\[p(\phi|\mathbf{Y})=\int p(\theta,\phi|\mathbf{Y})d\theta.\]

Or, using the formula of conditional probability:

\[\begin{split} \begin{align*} p(\phi|\mathbf{Y}) &= \frac{p(\theta,\phi|\mathbf{Y})}{p(\theta|\phi,\mathbf{Y})} \\\\ &\propto \frac{p(\theta|\phi) p(\phi) p(\mathbf{Y}|\theta)}{f(\phi,\mathbf{Y}) p(\theta|\phi) p(\mathbf{Y}|\theta)} \\\\ &\propto \frac{p(\phi)}{f(\phi,\mathbf{Y})} \end{align*} \end{split}\]

15.2. Rat tumor#

This example was taken from section 3.7 of [GCS+13]. The data for the example was taken from [Tar82]. The code with all the details is 24_RatTumorEmpiricalFullBayes.ipynb in the repository of the course.

As commented previously, in the evaluation of drugs for possible clinical application, studies are routinely performed on rodents. For a particular study the immediate aim is to estimate $\theta$, the probability of tumor in a population of female laboratory rats that receive a zero dose of the drug (a control group). The data show that 4 out of 14 rats developed endometrial stromal polyps (a kind of tumor). It is natural to assume a binomial model for the number of tumors, given $\theta$. For convenience, we select a prior distribution for $\theta$ from the conjugate family, $\theta\sim\textsf{Beta}(\alpha,\beta)$.

15.2.1. Empirical Bayes#

Using the historical data and the method of moments we can set some values for $\alpha$ and $\beta$. Since $\theta\sim\textsf{Beta}(\alpha,\beta)$, then

\[\mathbb{E}(\theta)=\frac{\alpha}{\alpha+\beta}\text{ and }\mathbb{V}(\theta)=\frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}.\]

Let be $\bar{Y}_j=Y_j/n_j$, $\bar{\bar Y}$ the mean of $\bar Y_1,\ldots,\bar Y_J$, and $s_{\bar{Y}}^2$ their variance.

Then, we have to find the values of $\alpha$ and $\beta$ such that

\[\bar{\bar Y}=\frac{\alpha}{\alpha+\beta} \text{ and }s_{\bar{Y}}^2=\frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}.\]

Solving these equations, we find that

\[\beta = \alpha\left(\frac{1-\bar{\bar Y}}{\bar{\bar Y}}\right)\text{ and }\alpha = \frac{\bar{\bar Y}^2(1-\bar{\bar Y})-\bar{\bar Y}s_{\bar{Y}}^2}{s_{\bar{Y}}^2}.\]

Using this simple estimate of the historical population distribution as a prior distribution for the current experiment yields a Beta($5.356,18.615$) posterior distribution for $\theta$.

The next figure shows the posterior distribuion of $\theta$, with a vertical line is the observed proportion 4/14.

Note that the prior information has resulted in a posterior distribution substantially lower than the crude proportion 4/14.

15.2.2. Full Bayesian approach#

When we use the estimated values for $\alpha$ and $\beta$, we act as if those values where the real ones, which eliminates all the uncertainty that we have on the hyperparameters. Instead of this empirical approach, we can choose for a full Bayesian approach and assign a hyperprior for $\alpha$ and $\beta$. Assume that we have $J$ historical data, and model

\[Y_j|\theta_j\sim\textsf{Binomial}(n_j,\theta_j),\]

with the number of rats, $n_j$ known, and

\[\theta_j\sim\textsf{Beta}(\alpha,\beta).\]

We just need to specify a prior for $(\alpha,\beta)$, but we have to check that the posterior exists.

\[\begin{split} \begin{align*} p(\theta,\alpha,\beta|\mathbf{Y}) &\propto p(\alpha,\beta)p(\theta|\alpha,\beta)p(\mathbf{Y}|\theta) \\\\ &\propto p(\alpha,\beta)\prod_{j=1}^J \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\theta_j^{\alpha-1}(1-\theta_j)^{\beta-1}\prod_{j=1}^J \theta_j^{Y_j}(1-\theta_j)^{1-Y_j}. \end{align*} \end{split}\]

Thus,

\[p(\theta|\alpha,\beta,\mathbf{Y}) \propto \prod_{j=1}^J \theta_j^{\alpha+Y_j-1}(1-\theta_j)^{\beta+n_j-Y_j-1}.\]

That is, given $\alpha$ and $\beta$, the elements of $\theta$ are independent and follow a beta distribution

\[\theta_j|\alpha,\beta,\mathbf{Y}\sim\textsf{Beta}(\alpha+Y_j,\beta+n_j-Y_j),\]

so

\[p(\theta|\alpha,\beta,\mathbf{Y}) = \prod_{j=1}^J \frac{\Gamma(\alpha+\beta+n_j)}{\Gamma(\alpha+Y_j)\Gamma(\beta+n_j-Y_j)}\theta_j^{\alpha+Y_j-1}(1-\theta_j)^{\beta+n_j-Y_j-1}.\]

Expressing $p(\theta|\alpha,\beta,\mathbf{Y})$ as:

\[p(\theta|\alpha,\beta,\mathbf{Y}) = f(\alpha,\beta,\mathbf{Y})p(\theta|\alpha,\beta)p(\mathbf{Y}|\theta),\]

we get that

\[f(\alpha,\beta,\mathbf{Y}) = \prod_{j=1}^J \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha+\beta)}\frac{\Gamma(\alpha+\beta+n_j)}{\Gamma(\alpha+Y_j)\Gamma(\beta+n_j-Y_j)}\left[\binom {n_j}{Y_j}\right]^{-1}.\]

Then,

\[p(\alpha,\beta|\mathbf{Y})\propto p(\alpha,\beta)\prod_{j=1}^J\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\frac{\Gamma(\alpha+Y_j)\Gamma(\beta+n_j-Y_j)}{\Gamma(\alpha+\beta+n_j)}.\]

In [GCS+13], the authors proposed the noninformative prior $p(\alpha,\beta)\propto(\alpha+\beta)^{-5/2}$, which yields a proper posterior distribution.

The next figure shows the marginal posterior distributions of $\alpha$ and $\beta$, the vertical lines represent their empirical estimators. Note that in both cases they understimate the possible values for $\alpha$ and $\beta$.

In the next figure I present the joint posterior of $\alpha$ and $\beta$, note that the joint posterior is far from a gaussian distribution. This is a typical behavior for hyperparameters in hierarchical models.

Finally the nect image shows credible intervals for all the $\theta_j$, $j=1,\ldots,J$ against their observed proportions. Note that cases where the proportion is low tend to be move upward, and viceversa, when the proportion is high the posterior estimator tends to move downward

15.3. Normal hierarchical model#

For this section I recommend to check:

My notes about the hierarchical normal model.
The article [GomezMendezA23].

15.4. Parallel experiments in eight schools#

This example was taken from section 5.5 of [GCS+13]. The code with all the details is 25_EducationalTestingExample.ipynb in the repository of the course.

A study was performed for the Educational Testing Service to analyze the effects of special coaching programs on test scores. Separate randomized experiments were performed to estimate the effects of coaching programs for the SAT-V (Scholastic Aptitude Test-Verbal) in each of eight high schools. We consider the sampling variances known, for all practical purposes, because the sample sizes in all the eight experiments were relatively large.

15.4.1. Separate Estimates#

We can consider each experiment separately with the following model:

\[Y_j|\theta_j\sim\textsf{Normal}(\theta_j,\sigma_j^2)\]

and

\[p(\theta_j)\propto 1_{\mathbb{R}}(\theta_j),\]

so the posterior would be

\[\theta_j|Y_j\sim\textsf{Normal}(Y_j,\sigma_j^2).\]

Treatening each experiment separately yields 95% posterior intervals which overlap substantially.

15.4.2. Complete pooled estimate#

The general overlap in the posterior intervals based on independent analyses suggests that all experiments might be estimating the same quantity. Under the hypothesis that all experiments have the same effect and produce independent estimates of this common effect, we could treat the data as eight normally distributed observations:

\[Y_j\sim\textsf{Normal}(\theta,\sigma_j^2).\]

If we model $p(\theta)\propto 1_{\mathbb{R}}(\theta)$ then $\theta|\mathbf{Y}\sim\textsf{Normal}(\bar{Y}_{\cdot\cdot}, \varphi^2)$, where

\[\bar{Y}_{\cdot\cdot}=\frac{\sum_{j=1}^J \frac{1}{\sigma_j^2}Y_j}{\sum_{j=1}^J \frac{1}{\sigma_j^2}}\]

and

\[\varphi^2=\frac{1}{\sum_{j=1}^J\frac{1}{\sigma_j^2}}.\]

To know the goodness-of-fit of the model we can consider the posterior predictive, which would be given by $$Y_j|\mathbf{Y}\sim\textsf{Normal}(\bar{Y}_{\cdot\cdot}, \sigma_j^2+\varphi^2)$$

From the previous figure we can see that the complete pooling might be a good model to explain the data.

15.4.3. Hierarchical model#

We get the MAP for $\tau$ and calculate an approximate credible interval of 0.95 of posterior probability using the asymptotic behavior of the posterior distribution. In the enxt figure we show the posterior distribution of $\tau$, where we can observe that the MAP estimator correspond with $\tau=0$, that is, with the complete pooling model.

We get a full Bayesian estimate for $\mu$ using $\hat{\mu}$ evaluated in the MAP of $\tau$. In the next figure I present its conditional posterior distribution setting $\tau=\tau^{\text{MAP}}$.

The next plots show the effect of $\tau$ in the posterior means of treatment effects, $E(\theta_j|\sigma,\tau,\mathbf{Y})$ and the posterior standard deviation of treatment effects, $\textsf{Std}(\theta_j|\sigma,\tau,\mathbf{Y})$

15.5. Meta-analysis#

Example taken from section 5.6 of [GCS+13]. The original data for the example are from [YPL+85]. The code with all the details is 26_MetaAnalysisCombiningStudies.ipynb in the repository of the course.

Meta-analysis is a process of summarizing and integrating the findings of research studies in a particular area.

The data in the example summarize mortality after myocardial infarction in 22 clinical trials, each consisting of two groups of heart attack patients randomly allocated to receive or not receive beta-blockers (a family of drugs that affect the central nervous system and can relax the heart muscles). The aim of a meta-analysis is to provide a combined analysis of the studies that indicates the overall strength of the evidence for a beneficial effect of the treatment under study.

If clinical trial $j$ involves the use of $n_{0j}$ subjects in the control group and $n_{1j}$ in the treatment group, giving rise to $y_{0j}$ and $y_{1j}$ deaths in control and treatment groups, respectively, then the usual sampling model involves two independent binomial distributions with probabilities $p_{0j}$ and $p_{1j}$, respectively. Estimands of interest include the odds ratio

\[\rho_j=\frac{\frac{p_{1j}}{1-p_{1j}}}{\frac{p_{0j}}{1-p_{0j}}}.\]

However, for a number of reasons, including interpretability and the fact that its posterior distribution is close to normality, we concentrate on inference for the logarithm of the odds ratio, which we label $\theta_j=\log\rho_j$.

We calculate the logarithms of the odds:

\[y_j=\log\left(\frac{y_{1j}}{n_{1j}-y_{1j}}\right)-\log\left(\frac{y_{0j}}{n_{0j}-y_{0j}}\right),\]

with approximate variance

\[\sigma_j^2=\frac{1}{y_{1j}}+\frac{1}{n_{1j}-y_{1j}}+\frac{1}{y_{0j}}+\frac{1}{n_{0j}-y_{0j}}.\]

For this data we are not considering the uncertainty in the variance. That is, we will act as if the variance is known and given by the observed one. Thus, we use the model:

\[\begin{split} \begin{align*} y_j|\theta_j,\sigma_j^2 &\sim \textsf{Normal}(\theta_j,\sigma_j^2) \\ \theta_j &\sim \textsf{Normal}(\mu,\tau^2) \\ p(\mu) &\propto 1_{\mathbb{R}}(\mu) \\ p(\tau) &\propto 1_{(0,\infty)}(\tau) \end{align*} \end{split}\]

The next table presents the data, the posterior median for each $\theta$ and its corresponding ETI.

	Study	ControlDeaths	ControlTotal	TreatedDeaths	TreatedTotal	LogOdds	Sigma	ETI 2.5%	Median	ETI 97.5%
0	1	3	39	3	38	0.028	0.850	-0.596	-0.244	0.149
1	2	14	116	7	114	-0.741	0.483	-0.668	-0.286	0.014
2	3	11	93	5	69	-0.541	0.565	-0.623	-0.270	0.034
3	4	127	1520	102	1533	-0.246	0.138	-0.438	-0.248	-0.043
4	5	27	365	28	355	0.069	0.281	-0.448	-0.193	0.155
5	6	6	52	4	59	-0.584	0.676	-0.605	-0.261	0.074
6	7	152	939	98	945	-0.512	0.139	-0.619	-0.367	-0.168
7	8	48	471	60	632	-0.079	0.204	-0.418	-0.201	0.067
8	9	37	282	25	278	-0.424	0.274	-0.606	-0.287	-0.017
9	10	188	1921	138	1916	-0.335	0.117	-0.485	-0.297	-0.123
10	11	52	583	64	873	-0.213	0.195	-0.483	-0.241	0.027
11	12	47	266	45	263	-0.039	0.229	-0.434	-0.201	0.098
12	13	16	293	9	291	-0.593	0.425	-0.661	-0.282	0.020
13	14	45	883	57	858	0.282	0.205	-0.339	-0.095	0.276
14	15	31	147	25	154	-0.321	0.298	-0.547	-0.264	0.025
15	16	38	213	33	207	-0.135	0.261	-0.475	-0.231	0.066
16	17	12	122	28	251	0.141	0.364	-0.464	-0.203	0.165
17	18	6	154	8	151	0.322	0.553	-0.523	-0.217	0.182
18	19	3	134	6	174	0.444	0.717	-0.544	-0.225	0.151
19	20	40	218	32	209	-0.218	0.260	-0.510	-0.240	0.044
20	21	43	364	27	391	-0.591	0.257	-0.669	-0.322	-0.087
21	22	39	674	22	680	-0.608	0.272	-0.662	-0.317	-0.063

The next table shows the posterior median and ETI for $\theta,\mu,\exp\{\mu\}$ and $\tau$.

Parameter	ETI 2.5%	Median	ETI 97.5%
$\theta$	-0.626	-0.249	0.135
$\mu$	-0.376	-0.248	-0.115
$\exp\{\mu\}$	0.687	0.78	0.892
$\tau$	0.047	0.139	0.308

In the original published discussion of these data, it was remarked that it seems and ‘unusually narrow range of uncertainty’. The Bayesian analysis suggests that this was due to the use of an inappropriate model that had the effect of claiming all the studies were identical. In mathematical terms, complete pooling makes the assumption that the parameter $\tau$ is exactly zero, whereas the data supply evidence that $\tau$ might be close to zero, but might also plausibly be as high as 0.3. The next table shows the posterior median and ETI of $\mu$ when $\tau=0$.

Parameter	ETI 2.5%	Median	ETI 97.5%
$\mu\vert\tau=0$	-0.359	-0.26	-0.161
$\exp\{\mu\}\vert\tau=0$	0.699	0.771	0.851

We make a scatterplot of the crude effect estimates versus the posterior median effect estimates for the 22 studies, the color of each point depends on the size of the population in the study. We observe that the studies with the smallest sample sizes are partially pooled the most toward the mean $\mu$.