7. Normal approximation#

Assume that \(Y_1,\ldots,Y_n\) are i.i.d. r.v. with common density \(f(Y)\), but we, in our ignorance, model them with the likelihood \(p(Y|\theta)\). Let be \(\theta_0\) the value that minimizes the Kullback-Leibler divergence between \(f(Y)\) and \(p(Y|\theta)\).

Under regularity conditions, for \(n\) sufficiently large, it can be shown that

\[\theta|\mathbf{Y}\overset{\cdot}{\sim}\textsf{Normal}\left(\theta_0, (nJ(\theta_0))^{-1}\right).\]

See also

Entropy and the Kullback-Leibler divergence are fundamental concepts in the fields of machine learning, statistics and information theory. For interested readers in entropy and its use in machine learning algorithms, I recommend the following video.

For those interested in the relationship between the entropy, the Kullback-Leibler divergence and their relationship with maximum likelihood, here you can find some notes that I wrote.

Since entropy and the Kullback-Leibler divergence are concepts strongly associated with the field of information theory, I recommend the following references for those interested [CT06, Yeu02].

7.1. Observed Fisher information#

If \(\hat{\theta}\) denotes the mode of the posterior distribution, known as the maximum a posteriori (MAP), under the same regularity conditions and for \(n\) sufficently large,

\[\hat{\theta}\approx\theta_0.\]

On the other hand, we define the observed Fisher information, in the Bayesin framework and under regularity conditions, as

\[I(\hat{\theta})=-\left[\frac{d^2}{d\theta^2}\log p(\theta|\mathbf{Y})\right]_{\theta=\hat{\theta}}.\]

Note that we can write this expression as:

\[\begin{split} \begin{align*} I(\hat{\theta}) &= -\left[\frac{d^2}{d\theta^2}\log p(\theta|\mathbf{Y})\right]_{\theta=\hat{\theta}} \\\\ & = -\frac{d^2}{d\theta^2}\Big[\log p(\mathbf{Y}|\theta) + \log p(\theta) - \log p(\mathbf{Y})\Big]_{\theta=\hat{\theta}} \\\\ & = -\left[\frac{d^2}{d\theta^2}\log p(\mathbf{Y}|\theta)\right]_{\theta=\hat{\theta}} - \left[\frac{d^2}{d\theta^2} \log p(\theta)\right]_{\theta=\hat{\theta}}, \end{align*} \end{split}\]

because we assume that \(Y_1,\ldots Y_n\) are i.i.d., then

\[\begin{split} \begin{align*} I(\hat{\theta}) &= -\sum_{i=1}^n\left[\frac{d^2}{d\theta^2}\log p(Y_i|\theta)\right]_{\theta=\hat{\theta}} - \left[\frac{d^2}{d\theta^2} \log p(\theta)\right]_{\theta=\hat{\theta}} \\\\ &= -n\left[\frac{1}{n}\sum_{i=1}^n\frac{d^2}{d\theta^2}\log p(Y_i|\theta)\right]_{\theta=\hat{\theta}} - \left[\frac{d^2}{d\theta^2} \log p(\theta)\right]_{\theta=\hat{\theta}}. \end{align*} \end{split}\]

By law of large numbers, under regularity conditions and for \(n\) sufficiently large,

\[-\frac{1}{n}\sum_{i=1}^n\left[\frac{d^2}{d\theta^2}\log p(Y_i|\theta)\right]_{\theta=\hat{\theta}} \approx J(\theta_0),\]

also the second term on the right side becomes negligible.

Thus, under regularity conditions and for \(n\) sufficiently large,

\[I(\hat{\theta})\approx nJ(\theta_0)\]

and

\[\theta|\mathbf{Y}\overset{\cdot}{\sim}\textsf{Normal}\left(\hat{\theta},\left(I(\hat{\theta})\right)^{-1}\right).\]

7.2. Region of approximately \((1-\alpha)\) posterior probability#

On the other hand, consider the Taylor series of the log-posterior around the MAP \(\hat{\theta}\),

\[\begin{split} \begin{align*} \log p(\theta|\mathbf{Y})-\log p(\hat{\theta}|\mathbf{Y}) &\approx \frac{1}{2}(\theta-\hat{\theta})^T\left[\frac{d^2}{d\theta^2}\log p(\theta|\mathbf{Y})\right]_{\theta=\hat{\theta}}(\theta-\hat{\theta})\\\\ \Rightarrow -2\log \frac{p(\theta|\mathbf{Y})}{p(\hat{\theta}|\mathbf{Y})} &\approx (\theta-\hat{\theta})^T\left[-\frac{d^2}{d\theta^2}\log p(\theta|\mathbf{Y})\right]_{\theta=\hat{\theta}}(\theta-\hat{\theta}). \end{align*} \end{split}\]

Thus, if \(\theta\) has dimension \(k\),

\[-2\log \frac{p(\theta|\mathbf{Y})}{p(\hat{\theta}|\mathbf{Y})}\overset{\cdot}{\sim}\chi^2_k.\]

Let be \(q_{\chi^2_k}^{1-\alpha}\) the quantile of probability \(1-\alpha\) of a \(\chi^2_k\) distribution, i.e.

\[\mathbb{P}\left(\chi^2_k\leq q_{\chi^2_k}^{1-\alpha}\right)=1-\alpha.\]

Therefore,

\[\begin{split} \begin{align*} \mathbb{P}\left[-2\log \frac{p(\theta|\mathbf{Y})}{p(\hat{\theta}|\mathbf{Y})}\leq q_{\chi^2_k}^{1-\alpha}\right] & \approx 1-\alpha \\\\ \Rightarrow \mathbb{P}\left[\frac{p(\theta|\mathbf{Y})}{p(\hat{\theta}|\mathbf{Y})}\geq \exp\left\{-\frac{q_{\chi^2_k}^{1-\alpha}}{2}\right\}\right] & \approx 1-\alpha. \end{align*} \end{split}\]

That is, the region of \(\Theta\), given by

\[R(\Theta)=\left\{\theta: \frac{p(\theta|\mathbf{Y})}{p(\hat{\theta}|\mathbf{Y})}\geq \exp\left\{-\frac{q_{\chi^2_k}^{1-\alpha}}{2}\right\}\right\}\]

corresponds to a region of approximately \(1-\alpha\) posterior probability.

7.3. Normal approximation of the Beta-Binomial model#

Let be \(Y_1,\ldots,Y_n|\theta\overset{iid}{\sim}\textsf{Bernoulli}(\theta)\), and consider the prior \(\textsf{Beta}(\alpha,\beta)\), then \(\theta|\mathbf{Y}\sim\textsf{Beta}(\alpha_n,\beta_n)\), where \(\alpha_n=\alpha+\sum_{i=1}^n Y_i\) and \(\beta_n=\beta+n-\sum_{i=1}^n Y_i\), so

\[p(\theta|\mathbf{Y})=\text{constant}\times\theta^{\alpha_n-1}(1-\theta)^{\beta_n-1}\]

and

\[\log p(\theta|\mathbf{Y})=\text{constant}+(\alpha_n-1)\log\theta+(\beta_n-1)\log(1-\theta).\]

We now compute the first derivative,

\[\frac{d}{d\theta}\log p(\theta|\mathbf{Y})=\frac{\alpha_n-1}{\theta}-\frac{\beta_n-1}{1-\theta}.\]

Assume that \(\alpha_n,\beta_n>1\). Then, the MAP of \(\theta\), \(\hat{\theta}\), satisfies

\[\begin{split} \begin{align*} \frac{1-\hat{\theta}}{\hat{\theta}} & = \frac{\beta_n-1}{\alpha_n-1} \\\\ \Rightarrow \frac{1}{\hat{\theta}} & = \frac{\beta_n-1}{\alpha_n-1}+1 \\\\ \Rightarrow \hat{\theta} & = \frac{\alpha_n-1}{\alpha_n+\beta_n-2}, \end{align*} \end{split}\]

which, of course, corresponds to the mode of a distribution \(\textsf{Beta}(\alpha_n,\beta_n)\).

Now, we calculate the second derivative and evaluate it in the MAP,

\[\left[\frac{d^2}{d\theta^2}\log p(\theta|\mathbf{Y})\right]_{\hat{\theta}}=-\frac{\alpha_n-1}{\hat{\theta}^2}-\frac{\beta_n-1}{(1-\hat{\theta})^2},\]

note that

\[1-\hat{\theta}=\hat{\theta}\left(\frac{\beta_n-1}{\alpha_n-1}\right),\]

then

\[\begin{split} \begin{align*} \left[\frac{d^2}{d\theta^2}\log p(\theta|\mathbf{Y})\right]_{\hat{\theta}} & = -\frac{\alpha_n-1}{\hat{\theta}^2}-\frac{(\alpha_n-1)^2}{\hat{\theta}^2(\beta_n-1)} \\\\ & = -\frac{\alpha_n-1}{\hat{\theta}^2}\left(1+\frac{\alpha_n-1}{\beta_n-1}\right) \\\\ & = -\frac{\alpha_n-1}{\hat{\theta}^2}\left(\frac{\alpha_n+\beta_n-2}{\beta_n-1}\right) \\\\ & = -\frac{\alpha_n+\beta_n-2}{\hat{\theta}(1-\hat{\theta})}, \end{align*} \end{split}\]

the last equality is obatined noting that

\[\frac{\alpha_n-1}{\beta_n-1}=\frac{\hat{\theta}}{1-\hat{\theta}}.\]

Thus,

\[\theta|\mathbf{Y}\overset{\cdot}{\sim}\textsf{Normal}\left(\hat{\theta},\frac{\hat{\theta}(1-\hat{\theta})}{\alpha_n+\beta_n-2}\right),\]

where

\[\hat{\theta}=\frac{\alpha_n-1}{\alpha_n+\beta_n-2}.\]

If we set \(\alpha=1\) and \(\beta=1\), then \(\alpha_n=1+\sum_{i=1}^n Y_i\), \(\beta_n=1+n-\sum_{i=1}^n Y_i\). So, \(\hat{\theta}=\bar{Y}\) and

\[\theta|\mathbf{Y}\overset{\cdot}{\sim}\textsf{Normal}\left(\hat{\theta},\frac{\hat{\theta}(1-\hat{\theta})}{n}\right)\]

In the code 07_NormalApproximation.ipynb within the repository of the course, there are two examples of the normal approximation for the posterior distribution, on for the Beta-Binomial model and the other for the Gamma-Exponential model.

7.4. Normal approximation for the normal likelihood considering the conjugate prior#

As mentioned in Chapter 4, if \(Y_1,\ldots,Y_n|\mu,\sigma^2\overset{iid}{\sim}\textsf{Normal}(\mu,\sigma^2)\), then the conjugate prior is given by the NI\(\chi^2\) distribution:

\[\begin{split} \begin{align*} \mu|\sigma^2 &\sim \textsf{Normal}(\mu_0,\sigma^2/\kappa_0) \\ \sigma^2 &\sim \textsf{Inversa}-\chi^2(\nu_0,\sigma_0^2), \end{align*} \end{split}\]

whose posterior distributions are given by

\[\begin{split} \begin{align*} \mu|\sigma^2,\mathbf{Y} &\sim \textsf{Normal}(\mu_n,\sigma^2/\kappa_n)\\ \sigma^2|\mathbf{Y} &\sim \textsf{Inversa}-\chi^2(\nu_n,\sigma_n^2), \end{align*} \end{split}\]

where

\[\begin{split} \begin{align*} \mu_n &= \frac{\kappa_0}{\kappa_0+n}\mu_0 + \frac{n}{\kappa_0+n}\bar{y} \\ \kappa_n &= \kappa_0+n \\ \nu_n &= \nu_0+n \\ \nu_n\sigma_n^2 &= \nu_0\sigma_0^2+(n-1)s^2+\frac{\kappa_0 n}{\kappa_0+n}(\bar{y}-\mu_0)^2. \end{align*} \end{split}\]

Then

\[\begin{split} \begin{align*} p(\mu,\sigma^2|\mathbf{Y})= & \text{constant} \times(\sigma^2)^{-1/2}\exp\left\{-\frac{\kappa_n}{2\sigma^2}(\mu-\mu_n)^2\right\} \\\\ & \times(\sigma^2)^{-(\nu_n/2+1)}\exp\left\{-\frac{\nu_n\sigma_n^2}{2\sigma^2}\right\}1_{\mathbb{R}}(\mu)1_{(0,\infty)}(\sigma^2), \end{align*} \end{split}\]

so

\[\log p(\mu,\sigma^2|\mathbf{Y})=\text{constant}-\left(\frac{\nu_n+3}{2}\right)\log(\sigma^2)-\frac{\kappa_n}{2\sigma^2}(\mu-\mu_n)^2-\frac{\nu_n\sigma_n^2}{2\sigma^2}.\]

We now calculate the first derivative of the log-posterior with respect to \(\mu\)

\[\frac{\partial}{\partial\mu}\log p(\mu,\sigma^2|\mathbf{Y})=-\frac{\kappa_n}{\sigma^2}(\mu-\mu_n),\]

where we can recognize easily that the MAP of \(\mu\) is given by \(\hat{\mu}=\mu_n\).

The first derivative of the log-posterior with respect to \(\sigma^2\) is given by

\[\frac{\partial}{\partial\sigma^2}\log p(\mu,\sigma^2|\mathbf{Y})=-\left(\frac{\nu_n+3}{2}\right)\frac{1}{\sigma^2}+\frac{\kappa_n}{2\sigma^4}(\mu-\mu_n)^2+\frac{\nu_n\sigma_n^2}{2\sigma^4}.\]

Then, the MAP of \(\sigma^2\), \(\hat{\sigma}^2\), satisfies that

\[-\left(\frac{\nu_n+3}{2}\right)\frac{1}{\hat\sigma^2}+\frac{\nu_n\sigma_n^2}{2\hat\sigma^4}=0,\]

where we get

\[\hat{\sigma}^2=\frac{\nu_n}{\nu_n+3}\sigma_n^2.\]

We now calculate the second derivatives of the log-posterior:

\[\begin{split} \begin{align*} \frac{\partial^2}{\partial\mu^2}\log p(\mu,\sigma^2|\mathbf{Y}) & = -\frac{\kappa_n}{\sigma^2}, \\\\ \frac{\partial^2}{\partial\mu\partial\sigma}\log p(\mu,\sigma^2|\mathbf{Y}) & = \frac{\kappa_n}{\sigma^4}(\mu-\mu_n), \\\\ \frac{\partial^2}{\partial(\sigma^2)^2}\log p(\mu,\sigma^2|\mathbf{Y}) & = \frac{\nu_n+3}{2\sigma^4}-\frac{\kappa_n}{\sigma^6}(\mu-\mu_n)^2-\frac{\nu_n\sigma_n^2}{\sigma^6}. \end{align*} \end{split}\]

And evaluate these expressions in the MAP of the parameters, \((\hat{\mu},\hat{\sigma}^2)\),

\[\begin{split} \begin{align*} \left[\frac{\partial^2}{\partial\mu^2}\log p(\mu,\sigma^2|\mathbf{Y})\right]_{(\hat{\mu},\hat{\sigma}^2)} & = -\frac{\kappa_n}{\hat\sigma^2}, \\\\ \left[\frac{\partial^2}{\partial\mu\partial\sigma^2}\log p(\mu,\sigma^2|\mathbf{Y})\right]_{(\hat{\mu},\hat{\sigma}^2)} & = 0, \\\\ \left[\frac{\partial^2}{\partial(\sigma^2)^2}\log p(\mu,\sigma^2|\mathbf{Y})\right]_{(\hat{\mu},\hat{\sigma}^2)} & = \frac{\nu_n+3}{2\hat\sigma^4}-\frac{\nu_n\sigma_n^2}{\hat\sigma^6} \\ & = \frac{\nu_n+3}{2\hat\sigma^4}-\frac{\cancel{\nu_n\sigma_n^2}}{\hat\sigma^4\cancel{\nu_n\sigma_n^2}}(\nu_n+3) \\ & = -\frac{\nu_n+3}{2\hat{\sigma}^4}. \end{align*} \end{split}\]

Then,

\[\begin{split}\mu,\sigma^2|\mathbf{Y}\overset{\cdot}{\sim}\textsf{Normal}\left( \begin{pmatrix} \hat{\mu} \\ \hat{\sigma}^2 \end{pmatrix}, \begin{pmatrix} \frac{\hat{\sigma}^2}{\kappa_n} & 0 \\ 0 & \frac{2}{\nu_n+3}\hat{\sigma}^4 \end{pmatrix} \right),\end{split}\]

where

\[\hat{\mu}=\mu_n\text{ and }\hat{\sigma}^2=\frac{\nu_n}{\nu_n+3}\sigma_n^2.\]

7.4.1. Normal aprroximation considering the reference prior#

As discussed in Chapter 6, the reference prior is given by

\[p(\mu,\sigma^2)\propto\frac{1}{\sigma^2}1_{\mathbb{R}}(\mu)1_{(0,\infty)}(\sigma^2),\]

which is obateined setting the hyperparameters in the following values

\[\kappa_0=0,\quad \mu_0\in\mathbb{R},\quad\nu_0=-1,\quad\sigma_0^2=0,\]

then

\[\kappa_n=n,\quad \mu_n=\bar{y},\quad\nu_n=n-1,\quad\sigma_n^2=s^2,\]

with

\(s^2=\frac{1}{n-1}\sum_{i=1}^n(y_i-\bar{y})^2\).

Therefore, if we take the reference prior, we get that

\[\begin{split}\mu,\sigma^2|\mathbf{Y}\overset{\cdot}{\sim}\textsf{Normal}\left( \begin{pmatrix} \bar{y} \\ \hat{\sigma}^2 \end{pmatrix}, \begin{pmatrix} \frac{\hat{\sigma}^2}{n} & 0 \\ 0 & \frac{2}{n+2}\hat{\sigma}^4 \end{pmatrix} \right), \end{split}\]

where

\[\hat{\sigma}^2=\frac{n-1}{n+2}s^2.\]

7.5. Exercises#