7. Normal approximation
Assume that \(Y_1,\ldots,Y_n\) are i.i.d. r.v. with common density \(f(Y)\) , but we, in our ignorance, model them with the likelihood \(p(Y|\theta)\) . Let be \(\theta_0\) the value that minimizes the Kullback-Leibler divergence between \(f(Y)\) and \(p(Y|\theta)\) .
Under regularity conditions, for \(n\) sufficiently large, it can be shown that
\[\theta|\mathbf{Y}\overset{\cdot}{\sim}\textsf{Normal}\left(\theta_0, (nJ(\theta_0))^{-1}\right).\]
See also
Entropy and the Kullback-Leibler divergence are fundamental concepts in the fields of machine learning, statistics and information theory. For interested readers in entropy and its use in machine learning algorithms, I recommend the following video.
VIDEO
For those interested in the relationship between the entropy, the Kullback-Leibler divergence and their relationship with maximum likelihood, here you can find some notes that I wrote.
Since entropy and the Kullback-Leibler divergence are concepts strongly associated with the field of information theory, I recommend the following references for those interested [CT06 , Yeu02 ] .
7.1. Observed Fisher information
If \(\hat{\theta}\) denotes the mode of the posterior distribution, known as the maximum a posteriori (MAP), under the same regularity conditions and for \(n\) sufficently large,
\[\hat{\theta}\approx\theta_0.\]
On the other hand, we define the observed Fisher information, in the Bayesin framework and under regularity conditions, as
\[I(\hat{\theta})=-\left[\frac{d^2}{d\theta^2}\log p(\theta|\mathbf{Y})\right]_{\theta=\hat{\theta}}.\]
Note that we can write this expression as:
\[\begin{split}
\begin{align*}
I(\hat{\theta}) &= -\left[\frac{d^2}{d\theta^2}\log p(\theta|\mathbf{Y})\right]_{\theta=\hat{\theta}} \\\\
& = -\frac{d^2}{d\theta^2}\Big[\log p(\mathbf{Y}|\theta) + \log p(\theta) - \log p(\mathbf{Y})\Big]_{\theta=\hat{\theta}} \\\\
& = -\left[\frac{d^2}{d\theta^2}\log p(\mathbf{Y}|\theta)\right]_{\theta=\hat{\theta}} - \left[\frac{d^2}{d\theta^2} \log p(\theta)\right]_{\theta=\hat{\theta}},
\end{align*}
\end{split}\]
because we assume that \(Y_1,\ldots Y_n\) are i.i.d., then
\[\begin{split}
\begin{align*}
I(\hat{\theta}) &= -\sum_{i=1}^n\left[\frac{d^2}{d\theta^2}\log p(Y_i|\theta)\right]_{\theta=\hat{\theta}} - \left[\frac{d^2}{d\theta^2} \log p(\theta)\right]_{\theta=\hat{\theta}} \\\\
&= -n\left[\frac{1}{n}\sum_{i=1}^n\frac{d^2}{d\theta^2}\log p(Y_i|\theta)\right]_{\theta=\hat{\theta}} - \left[\frac{d^2}{d\theta^2} \log p(\theta)\right]_{\theta=\hat{\theta}}.
\end{align*}
\end{split}\]
By law of large numbers, under regularity conditions and for \(n\) sufficiently large,
\[-\frac{1}{n}\sum_{i=1}^n\left[\frac{d^2}{d\theta^2}\log p(Y_i|\theta)\right]_{\theta=\hat{\theta}} \approx J(\theta_0),\]
also the second term on the right side becomes negligible.
Thus, under regularity conditions and for \(n\) sufficiently large,
\[I(\hat{\theta})\approx nJ(\theta_0)\]
and
\[\theta|\mathbf{Y}\overset{\cdot}{\sim}\textsf{Normal}\left(\hat{\theta},\left(I(\hat{\theta})\right)^{-1}\right).\]
7.2. Region of approximately \((1-\alpha)\) posterior probability
On the other hand, consider the Taylor series of the log-posterior around the MAP \(\hat{\theta}\) ,
\[\begin{split}
\begin{align*}
\log p(\theta|\mathbf{Y})-\log p(\hat{\theta}|\mathbf{Y}) &\approx \frac{1}{2}(\theta-\hat{\theta})^T\left[\frac{d^2}{d\theta^2}\log p(\theta|\mathbf{Y})\right]_{\theta=\hat{\theta}}(\theta-\hat{\theta})\\\\
\Rightarrow -2\log \frac{p(\theta|\mathbf{Y})}{p(\hat{\theta}|\mathbf{Y})} &\approx (\theta-\hat{\theta})^T\left[-\frac{d^2}{d\theta^2}\log p(\theta|\mathbf{Y})\right]_{\theta=\hat{\theta}}(\theta-\hat{\theta}).
\end{align*}
\end{split}\]
Thus, if \(\theta\) has dimension \(k\) ,
\[-2\log \frac{p(\theta|\mathbf{Y})}{p(\hat{\theta}|\mathbf{Y})}\overset{\cdot}{\sim}\chi^2_k.\]
Let be \(q_{\chi^2_k}^{1-\alpha}\) the quantile of probability \(1-\alpha\) of a \(\chi^2_k\) distribution, i.e.
\[\mathbb{P}\left(\chi^2_k\leq q_{\chi^2_k}^{1-\alpha}\right)=1-\alpha.\]
Therefore,
\[\begin{split}
\begin{align*}
\mathbb{P}\left[-2\log \frac{p(\theta|\mathbf{Y})}{p(\hat{\theta}|\mathbf{Y})}\leq q_{\chi^2_k}^{1-\alpha}\right] & \approx 1-\alpha \\\\
\Rightarrow \mathbb{P}\left[\frac{p(\theta|\mathbf{Y})}{p(\hat{\theta}|\mathbf{Y})}\geq \exp\left\{-\frac{q_{\chi^2_k}^{1-\alpha}}{2}\right\}\right] & \approx 1-\alpha.
\end{align*}
\end{split}\]
That is, the region of \(\Theta\) , given by
\[R(\Theta)=\left\{\theta: \frac{p(\theta|\mathbf{Y})}{p(\hat{\theta}|\mathbf{Y})}\geq \exp\left\{-\frac{q_{\chi^2_k}^{1-\alpha}}{2}\right\}\right\}\]
corresponds to a region of approximately \(1-\alpha\) posterior probability.
7.3. Normal approximation of the Beta-Binomial model
Let be \(Y_1,\ldots,Y_n|\theta\overset{iid}{\sim}\textsf{Bernoulli}(\theta)\) , and consider the prior \(\textsf{Beta}(\alpha,\beta)\) , then \(\theta|\mathbf{Y}\sim\textsf{Beta}(\alpha_n,\beta_n)\) , where \(\alpha_n=\alpha+\sum_{i=1}^n Y_i\) and \(\beta_n=\beta+n-\sum_{i=1}^n Y_i\) , so
\[p(\theta|\mathbf{Y})=\text{constant}\times\theta^{\alpha_n-1}(1-\theta)^{\beta_n-1}\]
and
\[\log p(\theta|\mathbf{Y})=\text{constant}+(\alpha_n-1)\log\theta+(\beta_n-1)\log(1-\theta).\]
We now compute the first derivative,
\[\frac{d}{d\theta}\log p(\theta|\mathbf{Y})=\frac{\alpha_n-1}{\theta}-\frac{\beta_n-1}{1-\theta}.\]
Assume that \(\alpha_n,\beta_n>1\) . Then, the MAP of \(\theta\) , \(\hat{\theta}\) , satisfies
\[\begin{split}
\begin{align*}
\frac{1-\hat{\theta}}{\hat{\theta}} & = \frac{\beta_n-1}{\alpha_n-1} \\\\
\Rightarrow \frac{1}{\hat{\theta}} & = \frac{\beta_n-1}{\alpha_n-1}+1 \\\\
\Rightarrow \hat{\theta} & = \frac{\alpha_n-1}{\alpha_n+\beta_n-2},
\end{align*}
\end{split}\]
which, of course, corresponds to the mode of a distribution \(\textsf{Beta}(\alpha_n,\beta_n)\) .
Now, we calculate the second derivative and evaluate it in the MAP,
\[\left[\frac{d^2}{d\theta^2}\log p(\theta|\mathbf{Y})\right]_{\hat{\theta}}=-\frac{\alpha_n-1}{\hat{\theta}^2}-\frac{\beta_n-1}{(1-\hat{\theta})^2},\]
note that
\[1-\hat{\theta}=\hat{\theta}\left(\frac{\beta_n-1}{\alpha_n-1}\right),\]
then
\[\begin{split}
\begin{align*}
\left[\frac{d^2}{d\theta^2}\log p(\theta|\mathbf{Y})\right]_{\hat{\theta}} & = -\frac{\alpha_n-1}{\hat{\theta}^2}-\frac{(\alpha_n-1)^2}{\hat{\theta}^2(\beta_n-1)} \\\\
& = -\frac{\alpha_n-1}{\hat{\theta}^2}\left(1+\frac{\alpha_n-1}{\beta_n-1}\right) \\\\
& = -\frac{\alpha_n-1}{\hat{\theta}^2}\left(\frac{\alpha_n+\beta_n-2}{\beta_n-1}\right) \\\\
& = -\frac{\alpha_n+\beta_n-2}{\hat{\theta}(1-\hat{\theta})},
\end{align*}
\end{split}\]
the last equality is obatined noting that
\[\frac{\alpha_n-1}{\beta_n-1}=\frac{\hat{\theta}}{1-\hat{\theta}}.\]
Thus,
\[\theta|\mathbf{Y}\overset{\cdot}{\sim}\textsf{Normal}\left(\hat{\theta},\frac{\hat{\theta}(1-\hat{\theta})}{\alpha_n+\beta_n-2}\right),\]
where
\[\hat{\theta}=\frac{\alpha_n-1}{\alpha_n+\beta_n-2}.\]
If we set \(\alpha=1\) and \(\beta=1\) , then \(\alpha_n=1+\sum_{i=1}^n Y_i\) , \(\beta_n=1+n-\sum_{i=1}^n Y_i\) . So, \(\hat{\theta}=\bar{Y}\) and
\[\theta|\mathbf{Y}\overset{\cdot}{\sim}\textsf{Normal}\left(\hat{\theta},\frac{\hat{\theta}(1-\hat{\theta})}{n}\right)\]
In the code 07_NormalApproximation.ipynb within the repository of the course , there are two examples of the normal approximation for the posterior distribution, on for the Beta-Binomial model and the other for the Gamma-Exponential model.
7.4. Normal approximation for the normal likelihood considering the conjugate prior
As mentioned in Chapter 4 , if \(Y_1,\ldots,Y_n|\mu,\sigma^2\overset{iid}{\sim}\textsf{Normal}(\mu,\sigma^2)\) , then the conjugate prior is given by the NI\(\chi^2\) distribution:
\[\begin{split}
\begin{align*}
\mu|\sigma^2 &\sim \textsf{Normal}(\mu_0,\sigma^2/\kappa_0) \\
\sigma^2 &\sim \textsf{Inversa}-\chi^2(\nu_0,\sigma_0^2),
\end{align*}
\end{split}\]
whose posterior distributions are given by
\[\begin{split}
\begin{align*}
\mu|\sigma^2,\mathbf{Y} &\sim \textsf{Normal}(\mu_n,\sigma^2/\kappa_n)\\
\sigma^2|\mathbf{Y} &\sim \textsf{Inversa}-\chi^2(\nu_n,\sigma_n^2),
\end{align*}
\end{split}\]
where
\[\begin{split}
\begin{align*}
\mu_n &= \frac{\kappa_0}{\kappa_0+n}\mu_0 + \frac{n}{\kappa_0+n}\bar{y} \\
\kappa_n &= \kappa_0+n \\
\nu_n &= \nu_0+n \\
\nu_n\sigma_n^2 &= \nu_0\sigma_0^2+(n-1)s^2+\frac{\kappa_0 n}{\kappa_0+n}(\bar{y}-\mu_0)^2.
\end{align*}
\end{split}\]
Then
\[\begin{split}
\begin{align*}
p(\mu,\sigma^2|\mathbf{Y})= & \text{constant} \times(\sigma^2)^{-1/2}\exp\left\{-\frac{\kappa_n}{2\sigma^2}(\mu-\mu_n)^2\right\} \\\\
& \times(\sigma^2)^{-(\nu_n/2+1)}\exp\left\{-\frac{\nu_n\sigma_n^2}{2\sigma^2}\right\}1_{\mathbb{R}}(\mu)1_{(0,\infty)}(\sigma^2),
\end{align*}
\end{split}\]
so
\[\log p(\mu,\sigma^2|\mathbf{Y})=\text{constant}-\left(\frac{\nu_n+3}{2}\right)\log(\sigma^2)-\frac{\kappa_n}{2\sigma^2}(\mu-\mu_n)^2-\frac{\nu_n\sigma_n^2}{2\sigma^2}.\]
We now calculate the first derivative of the log-posterior with respect to \(\mu\)
\[\frac{\partial}{\partial\mu}\log p(\mu,\sigma^2|\mathbf{Y})=-\frac{\kappa_n}{\sigma^2}(\mu-\mu_n),\]
where we can recognize easily that the MAP of \(\mu\) is given by \(\hat{\mu}=\mu_n\) .
The first derivative of the log-posterior with respect to \(\sigma^2\) is given by
\[\frac{\partial}{\partial\sigma^2}\log p(\mu,\sigma^2|\mathbf{Y})=-\left(\frac{\nu_n+3}{2}\right)\frac{1}{\sigma^2}+\frac{\kappa_n}{2\sigma^4}(\mu-\mu_n)^2+\frac{\nu_n\sigma_n^2}{2\sigma^4}.\]
Then, the MAP of \(\sigma^2\) , \(\hat{\sigma}^2\) , satisfies that
\[-\left(\frac{\nu_n+3}{2}\right)\frac{1}{\hat\sigma^2}+\frac{\nu_n\sigma_n^2}{2\hat\sigma^4}=0,\]
where we get
\[\hat{\sigma}^2=\frac{\nu_n}{\nu_n+3}\sigma_n^2.\]
We now calculate the second derivatives of the log-posterior:
\[\begin{split}
\begin{align*}
\frac{\partial^2}{\partial\mu^2}\log p(\mu,\sigma^2|\mathbf{Y}) & = -\frac{\kappa_n}{\sigma^2}, \\\\
\frac{\partial^2}{\partial\mu\partial\sigma}\log p(\mu,\sigma^2|\mathbf{Y}) & = \frac{\kappa_n}{\sigma^4}(\mu-\mu_n), \\\\
\frac{\partial^2}{\partial(\sigma^2)^2}\log p(\mu,\sigma^2|\mathbf{Y}) & = \frac{\nu_n+3}{2\sigma^4}-\frac{\kappa_n}{\sigma^6}(\mu-\mu_n)^2-\frac{\nu_n\sigma_n^2}{\sigma^6}.
\end{align*}
\end{split}\]
And evaluate these expressions in the MAP of the parameters, \((\hat{\mu},\hat{\sigma}^2)\) ,
\[\begin{split}
\begin{align*}
\left[\frac{\partial^2}{\partial\mu^2}\log p(\mu,\sigma^2|\mathbf{Y})\right]_{(\hat{\mu},\hat{\sigma}^2)} & = -\frac{\kappa_n}{\hat\sigma^2}, \\\\
\left[\frac{\partial^2}{\partial\mu\partial\sigma^2}\log p(\mu,\sigma^2|\mathbf{Y})\right]_{(\hat{\mu},\hat{\sigma}^2)} & = 0, \\\\
\left[\frac{\partial^2}{\partial(\sigma^2)^2}\log p(\mu,\sigma^2|\mathbf{Y})\right]_{(\hat{\mu},\hat{\sigma}^2)} & = \frac{\nu_n+3}{2\hat\sigma^4}-\frac{\nu_n\sigma_n^2}{\hat\sigma^6} \\
& = \frac{\nu_n+3}{2\hat\sigma^4}-\frac{\cancel{\nu_n\sigma_n^2}}{\hat\sigma^4\cancel{\nu_n\sigma_n^2}}(\nu_n+3) \\
& = -\frac{\nu_n+3}{2\hat{\sigma}^4}.
\end{align*}
\end{split}\]
Then,
\[\begin{split}\mu,\sigma^2|\mathbf{Y}\overset{\cdot}{\sim}\textsf{Normal}\left(
\begin{pmatrix}
\hat{\mu} \\ \hat{\sigma}^2
\end{pmatrix},
\begin{pmatrix}
\frac{\hat{\sigma}^2}{\kappa_n} & 0 \\ 0 & \frac{2}{\nu_n+3}\hat{\sigma}^4
\end{pmatrix}
\right),\end{split}\]
where
\[\hat{\mu}=\mu_n\text{ and }\hat{\sigma}^2=\frac{\nu_n}{\nu_n+3}\sigma_n^2.\]
7.4.1. Normal aprroximation considering the reference prior
As discussed in Chapter 6 , the reference prior is given by
\[p(\mu,\sigma^2)\propto\frac{1}{\sigma^2}1_{\mathbb{R}}(\mu)1_{(0,\infty)}(\sigma^2),\]
which is obateined setting the hyperparameters in the following values
\[\kappa_0=0,\quad \mu_0\in\mathbb{R},\quad\nu_0=-1,\quad\sigma_0^2=0,\]
then
\[\kappa_n=n,\quad \mu_n=\bar{y},\quad\nu_n=n-1,\quad\sigma_n^2=s^2,\]
with
\(s^2=\frac{1}{n-1}\sum_{i=1}^n(y_i-\bar{y})^2\) .
Therefore, if we take the reference prior, we get that
\[\begin{split}\mu,\sigma^2|\mathbf{Y}\overset{\cdot}{\sim}\textsf{Normal}\left(
\begin{pmatrix}
\bar{y} \\ \hat{\sigma}^2
\end{pmatrix},
\begin{pmatrix}
\frac{\hat{\sigma}^2}{n} & 0 \\ 0 & \frac{2}{n+2}\hat{\sigma}^4
\end{pmatrix}
\right),
\end{split}\]
where
\[\hat{\sigma}^2=\frac{n-1}{n+2}s^2.\]