Conjugate analysis

3. Conjugate analysis#

When the posterior distribution follows the same parametric structure than the prior distribution, we say that we have a conjugate model.

Formally, if \(p(Y|\theta)\in\mathcal{F}\) and \(\mathcal{P}\) is a family of prior distributions for \(\theta\), then \(\mathcal{P}\) is conjugate for \(\mathcal{F}\) if \(p(\theta|\mathbf{Y})\in\mathcal{P}\) for all \(p(\cdot|\theta)\in\mathcal{F}\) and \(p(\cdot)\in\mathcal{P}\).

3.1. Conjugate models for the exponential family#

When the likelihood of the data follows a distribution of the exponential family of distributions, it is possible to obtain the form of the prior conjugate.

Let be \(p(Y|\theta)\in\mathcal{F}\), where \(\mathcal{F}\) is the exponential family of distributions. Then, by the Fisher’s factorization theorem, \(p(Y|\theta)\) can be expressed as:

\[p(Y|\theta)=f(y)g(\theta)\exp\left\lbrace\phi^T(\theta)u(y)\right\rbrace,\]

where \(\phi(\theta)\) and \(u(y)\) are, in general, of the same dimension than \(\theta\), \(\phi(\theta)\) is called the natural parameter of the family \(\mathcal{F}\).

If \(Y_1,\ldots,Y_n\) are random variables independent and identically distributed, then

\[\begin{split} \begin{align*} p(\mathbf{Y}|\theta) &= \left(\prod_{i=1}^n f(y_i)\right)g^n(\theta)\exp\left\lbrace\phi^T(\theta)\sum_{i=1}^n u(y_i)\right\rbrace \\\\ &\propto g^n(\theta)e^{\phi^T(\theta)t(\mathbf{y}),} \end{align*} \end{split}\]

where \(t(\mathbf{y})=\sum_{i=1}^n u(y_i)\) is called the sufficient and minimal.

If we consider the prior of the form

\[p(\theta)\propto g(\theta)^\eta e^{\phi^T(\theta)\xi},\]

then

\[p(\theta|\mathbf{Y})\propto g(\theta)^{n+\eta}e^{\phi^T(\theta)(\xi+t(\mathbf{y}))}.\]

Moreover, note that, due to the structure, we can interpret \(\eta\) as the “size” of the sample a priori and \(\xi\) as the “sufficient and minimal statistic” a priori.

3.2. Binomial likelihood#

Let be \(Y_1,\ldots,Y_n|\theta\overset{iid}{\sim}\textsf{Bernoulli}(\theta)\), then

\[\begin{split} \begin{align*} p(\mathbf{Y}|\theta) & = \theta^{\sum_{i=1}^n y_i}(1-\theta)^{n-\sum_{i=1}^n y_i} \underbrace{\prod_{i=1}^n 1_{\{0,1\}}(y_i)}_{f(\mathbf{y})} \\\\ & = f(\mathbf{y})\exp\left\lbrace\sum_{i=1}^n y_i\log\theta + \left(n-\sum_{i=1}^n y_i\right)\log (1-\theta)\right\rbrace \\\\ & = f(\mathbf{y})\exp\left\lbrace n \log (1-\theta) + \sum_{i=1}^n y_i\log\frac{\theta}{1-\theta}\right\rbrace \\\\ & = f(\mathbf{y})\underbrace{(1-\theta)^n}_{g(\theta)^n}\exp\underbrace{\left\lbrace\sum_{i=1}^n y_i\log\frac{\theta}{1-\theta}\right\rbrace}_{t(\mathbf{y})\phi(\theta)} \end{align*} \end{split}\]

Therefore, the conjugate prior is given by

\[\begin{split} \begin{align*} p(\theta) & \propto (1-\theta)^\eta\exp\left\lbrace\xi\log\frac{\theta}{1-\theta}\right\rbrace \\\\ & = (1-\theta)^{\eta-\xi}\theta^{\xi}, \end{align*} \end{split}\]

note that we can interpreate \(\eta\) and \(\xi\) as:

\(\eta\): number of Bernoulli experiments a priori,
\(\xi\): number of successes a priori,

so \(\eta-\xi\) would be the number of fails a priori.

Let be \(\alpha-1=\xi\) and \(\beta-1=\eta-\xi\), then the conjugate prior can be expressed as

\[p(\theta)\propto \theta^{\alpha-1}(1-\theta)^{\beta-1},\]

which we recognize as the kernel of a distrbution \(\textsf{Beta}(\alpha,\beta)\).

3.3. Poisson likelihood#

Let be \(Y_1,\ldots,Y_n|\theta\overset{iid}{\sim}\textsf{Poisson}(\theta)\), then

\[\begin{split} \begin{align*} p(\mathbf{Y}|\theta) & = \underbrace{\left\lbrack\prod_{i=1}^n \frac{1}{y_i!}1_{\{0,1,\ldots\}}(y_i)\right\rbrack}_{f(\mathbf{y})}\theta^{\sum_{i=1}^n y_i}\exp\{-n\theta\} \\\\ & = f(\mathbf{y})\underbrace{\left(e^{-\theta}\right)^n}_{g(\theta)^n}\exp\underbrace{\left\{\sum_{i=1}^n y_i\log\theta\right\}}_{t(\mathbf{y})\phi(\theta)}. \end{align*} \end{split}\]

Therefore, the conjugate prior is given by

\[\begin{split} \begin{align*} p(\theta) & \propto e^{-\eta\theta}\exp\{\xi\log\theta\} \\\\ & = \theta^\xi e^{-\eta\theta} \end{align*} \end{split}\]

Let be \(\xi=\alpha-1\) and \(\eta=\beta\), then the conjugate prior might be expressed as

\[p(\theta)\propto\theta^{\alpha-1}e^{-\beta\theta},\]

which we recognize as the kernel of a distribution \(\textsf{Gama}(\alpha,\beta)\).

Note that \(\alpha-1\) is equivalent as the sum of prior counts and \(\beta\) the number of prior experiments. Therefore, if we haven’t done experiments previously, we can set \(\alpha=1\) and \(\beta=0\), in such case \(p(\theta)\propto 1_{(0,\infty)}(\theta)\).

3.4. Normal likelihood with unknown mean#

Let be \(Y_1,\ldots,Y_n|\theta,\sigma^2\overset{iid}{\sim}\textsf{Normal}(\theta, \sigma^2)\) and assume \(\sigma^2>0\) known. Then

\[\begin{split} \begin{align*} p(\mathbf{Y}|\theta) & = \left[\frac{1}{(2\pi\sigma^2)^{n/2}}\prod_{i=1}^n 1_{\mathbb{R}}(y_i)\right]\exp\left\{-\frac{1}{2\sigma^2}\sum_{i=1}^n(y_i-\theta)^2\right\} \\\\ & = \underbrace{\frac{1}{(2\pi\sigma^2)^{n/2}}\prod_{i=1}^n 1_{\mathbb{R}}(y_i)\exp\left\{-\frac{1}{2\sigma^2}\sum_{i=1}^n y_i^2\right\}}_{f(\mathbf{y})}\underbrace{\exp\left\{-\frac{n}{2\sigma^2}\theta^2\right\}}_{g(\theta)^n}\exp\underbrace{\left\{\sum_{i=1}^n y_i \frac{\theta}{\sigma^2}\right\}}_{t(\mathbf{y})\phi(\theta)} \end{align*} \end{split}\]

Thus, the conjugate prior is given by:

\[\begin{split} \begin{align*} p(\theta) & \propto \exp\left\{-\frac{\eta}{2\sigma^2}\theta^2\right\}\exp\left\{\xi \frac{\theta}{\sigma^2}\right\} \\\\ & = \exp\left\{-\frac{1}{2\sigma^2}(\theta^2\eta-2\theta\xi)\right\} \\\\ & = \exp\left\{-\frac{\eta}{2\sigma^2}\left(\theta^2-2\theta\frac{\xi}{\eta}\right)\right\} \\\\ & \propto \exp\left\{-\frac{1}{2\sigma^2/\eta}\left(\theta-\frac{\xi}{\eta}\right)^2\right\}, \end{align*} \end{split}\]

which corresponds with the kernel of a normal distribution with mean \(\xi/\eta\) and variance \(\sigma^2/\eta\).

Note that \(\eta\) correspond with the prior “sample” size and \(\xi\) with the “prior statistic” \(\sum_{i=1}^n y_i\), thus \(\xi/\eta\) corresponds with the “prior” value of \(\bar{y}\) and \(\sigma^2/\eta\) with the variance of such average.

3.4.1. Posterior distribution#

Let be \(Y_1,\ldots,Y_n|\theta,\sigma^2\overset{iid}{\sim}\textsf{Normal}(\theta, \sigma^2)\), assume \(\sigma^2>0\) known, and consider the prior distribution for \(\theta\), \(\theta\sim\textsf{Normal}(\mu_0,\tau_0^2)\). Find the posterior distribution of \(\theta\).

Click the button to reveal the answer!

\[\theta|\mathbf{Y}\sim\textsf{Normal}(\mu_n,\tau_n^2),\]

where

\[\mu_n=\frac{\frac{1}{\tau_0^2}\mu_0+\frac{n}{\sigma^2}\bar{Y}}{\frac{1}{\tau_0^2}+\frac{n}{\sigma^2}}\quad\text{and}\quad \tau_n^2=\frac{1}{\frac{1}{\tau_0^2}+\frac{n}{\sigma^2}}\]

Precision parameter for the normal distribution

In Bayesian statistics is common to parametrize the normal distribution using \(\lambda=\sigma^{-2}\), which is known as the “precision” of the distribution. Let as it be, I won’t use this convention, and will parametrize the normal distribution in terms of its mean and variance.

3.4.2. Posterior predictive distribution#

Remember that the posterior predictive distribution can be obtained marginalizing \(\theta\) from the joint distribution \(p(Y,\theta|\mathbf{Y})=p(Y|\theta)p(\theta|\mathbf{Y})\), that is,

\[\begin{split} \begin{align*} p(Y|\mathbf{Y}) & = \int_\Theta p(Y|\theta)p(\theta|\mathbf{Y})d\theta \\\\ & \propto \int_\Theta \exp\left\{-\frac{1}{2\sigma^2}(Y-\theta)^2\right\}\exp\left\{-\frac{1}{2\tau_n^2}(\theta-\mu_n)^2\right\}d\theta \end{align*} \end{split}\]

Note that the integrand is the exponential of a quadratic function of \((Y,\theta)\). Therefore, by properties of the bivariate normal distribution, \((Y,\theta|\mathbf{Y})\) have a multivariate normal distribution. Then \(Y|\mathbf{Y}\), being the marginal of such bivariate distribution must follow a normal distribution as well. Thus, all that rest to do is to calculate its mean and variance.

We can calculate the mean and variance of the posterior predictive distribution in the following way:

\[\mathbb{E}(Y|\mathbf{Y})=\mathbb{E}_{\theta|\mathbf{Y}}[\mathbb{E}(Y|\theta,\mathbf{Y})]=\mathbb{E}_{\theta|\mathbf{Y}}(\theta)=\mu_n\]

and

\[\begin{split} \begin{align*} \mathbb{V}(Y|\mathbf{Y}) & = \mathbb{E}_{\theta|\mathbf{Y}}[\mathbb{V}(Y|\theta,\mathbf{Y})]+\mathbb{V}_{\theta|\mathbf{Y}}[\mathbb{E}(Y|\theta,\mathbf{Y})] \\\\ & = \mathbb{E}_{\theta|\mathbf{Y}}[\sigma^2]+\mathbb{V}_{\theta|\mathbf{Y}}[\theta] \\\\ & = \sigma^2 + \tau_n^2 \end{align*} \end{split}\]

3.5. Normal likelihood with unknown variance#

Consider \(Y_1,\ldots,Y_n\overset{iid}{\sim}\textsf{Normal}(\theta,\sigma^2)\) with \(\theta\) known, then

\[\begin{split} \begin{align*} p(\mathbf{Y}|\sigma^2) &\propto (\sigma^2)^{-n/2}\exp\left\{-\frac{1}{2\sigma^2}\sum_{i=1}^n (y_i-\theta)^2\right\} \\\\ & = (\sigma^2)^{-(n/2-1+1)}\exp\left\{-\frac{(n-2)t(\mathbf{y})}{2\sigma^2}\right\} \\\\ & = (\sigma^2)^{-\left(\frac{n-2}{2}+1\right)}\exp\left\{-\frac{(n-2)t(\mathbf{y})}{2\sigma^2}\right\}, \end{align*} \end{split}\]

where \(t(\mathbf{y})=\frac{1}{n-2}\sum_{i=1}^n(y_i-\theta)^2\) (assuming \(n>2\)). Thus, the conjugate prior of \(\sigma^2\) has the form

\[p(\sigma^2)\propto (\sigma^2)^{-\left(\frac{\eta-2}{2}+1\right)}\exp\left\{-\frac{(\eta-2)\xi}{2\sigma^2}\right\},\]

let be \(\nu=\eta-2\), then

\[p(\sigma^2)\propto (\sigma^2)^{-\left(\frac{\nu}{2}+1\right)}\exp\left\{-\frac{\nu\xi}{2\sigma^2}\right\},\]

which correspond with the kernel of a scaled inverse \(\chi^2\) distribution with \(\nu\) degrees of freedom and scale parameter \(\xi\), which we’ll denote as \(\textsf{Inverse-}\chi^2(\nu,\xi)\).

Prior distribution when we don’t have prior information

In the previous case, we denoted the defrees of freedom as \(\nu=\eta-2\). Remember that \(\eta\) corresponds to the “size” of a prior sample, so if we haven’t done previous experiments, we can set \(\eta=0\) and \(\xi=0\), in such case, the prior distribution for \(\sigma^2\) would be

\[p(\sigma^2)\propto 1_{(0,\infty)}(\sigma^2),\]

which follows the principle of indifference.

Inverse gamma or scaled inverse \(\chi^2\)

In the Bayesian literature, many times the conjugate prior for \(\sigma^2\) is said to be an inversed gamma distribution. This is because the scaled inverse \(\chi^2\) distribution is a particular case of an inversed gamma distribution. The inversed gamma distribution has a shape parameter \(\alpha>0\) and a scale parameter \(\beta>0\), when \(\alpha=\nu/2\) and \(\beta=\nu\tau^2/2\) we obtain a scaled inverse \(\chi^2\) distribution with \(\nu\) degrees of freedom and scale parameter \(\tau^2\).

Let as it be, I find easier to interpret the parameters of the scaled inverse \(\chi^2\) distribution in the Bayesian paradigm, than the parameters of the corresponding inversed gamma distribution, so I will use the former rather than the latter.

3.6. Predictive distribution for the Gamma-Poisson model#

Up to now we have said that the prior predictive distribution is calculated marginalizing the joint distribution of \(\theta\) and \(Y\), this is

\[p(Y)=\int_{\Theta}p(Y,\theta)d\theta,\]

however we can avoid the (generally) tedious calculation of the previous integral and find the prior predictive distribution using the formula:

\[p(Y)=\frac{p(Y|\theta)p(\theta)}{p(\theta|Y)}.\]

Consider the Gamma-Poisson model, using the previous formula find that the prior predictive distribution has the form:

\[p(Y)=\binom{\alpha+y-1}{y}\left(\frac{\beta}{\beta+1}\right)^\alpha\left(\frac{1}{\beta+1}\right)^y 1_{\{0,1,\ldots\}}(y).\]

That is, \(Y\) follows a negative binomial distribution with shape parameter \(\alpha\) and inverse scale parameter \(\beta\).

Relation between prior and posterior distributions for conjuagte models

One advantage of the conjugate models is that it is easy to determine the posterior distribution and the posterior predictive distribution if we know the posterior values of the hyperparameters.

Because the prior and the posterior distributions have the belong to the same family of distributions, as a consequence the prior and posterior predictive distributions also belong to the same family of distributions, we just have to replace the prior values of the hyperparamters for their posterior values.

3.7. Poisson model with exposition#

In many applications, is convenient to extend the Poisson model to the form:

\[Y|x,\theta\sim\textsf{Poisson}(x\theta).\]

in epidemiology, the paramter \(\theta\) is called the rate and \(x\) is called the exposition.

Assume that we have a sample \(Y_i|x_i,\theta\overset{iid}{\sim}\textsf{Poisson}(x_i\theta)\). If we use a prior \(\theta\sim\textsf{Gamma}(\alpha,\beta)\), then it can be shown that

\[\theta|\mathbf{Y}\sim\textsf{Gama}\left(\alpha+\sum_{i=1}^n y_i,\beta+\sum_{i=1}^n x_i\right).\]

Check the following codes in the repository of the course where the Poisson model with exposition is used, both examples were taken from [GCS+13].

Cheatsheet of Bayesian models

I encourage you to make a cheatsheet with the Bayesian models presented in this book. I also encourage you to add the interpretation of the models and hyperparameters. It is usually pointless trying to memorize the priors posteriors, hyperparameters, etc. But it is useful for real-data analyses to understand the models and to know how to interpret the parameters and hyperparamters.

Here I share with you the template of the cheatsheet that I use, but you are free to use your own, adding the information and comments that you consider important.

3.8. Exercises#

Calculate the posterior distribution for the Gamma-Poisson model.
Consider the following parametrization for the exponential distribution \(p(Y|\theta)=\theta e^{-\theta y}1_{(0,\infty)}(y)\)
1. Using the results of the conjugate model for the exponential family, prove that the gamma distribution is conjugate of the exponential distribution
2. Find its posterior distribution.
3. Interpret the hyperparameters and find the prior distribution distribution when we don’t have prior information.
Consider the case of the normal distribution with unknown mean \(\theta\) but known variance \(\sigma^2\), and take as prior for \(\theta\) a normal distribution. Interpret the hyperparameters, and from this interpretation find the prior distribution distribution when we don’t have prior information.