The main objectives of this exercise are fulfilled if you
Understand that the likelihood is the carrier of the data summaries;
Understand that prior and likelihood are sufficient to obtain posterior, prior predictive and posterior predictive;
Understand that MOST of the times the above derivations aren’t available in closed form (like with the prior 2 below);
Understand that there is no Bayesian MCMC model! There is a model and a prior (therefore a Bayesian!) who might use MCMC (or other MC-based tools) to approximate posterior summaries.
Understand that all models (and priors) are wrong, but some of them are useful!
Understand that all data-driven, model-based scientific inquiries are tackled invariably the same way from the Bayesian viewpoint.
As in previous examples, let us assume that \(x_1,\ldots,x_n\) are i.i.d. Bernoulli trials with \(\theta\) the probability of sucess, i.e. \[ Pr(x_i=1|\theta) \ \ \ \mbox{and} \ \ \ Pr(x_i=0|\theta)=1-\theta. \] Therefore, it is easy to see that, given \(\theta\), \(y=x_1+\cdots+x_n\) is Binomial with \(n\) trials and probability of sucess \(\theta\), denoted by \(y \sim Bin(n,\theta)\). Its probability density function is given by \[ Pr(y|\theta) = \frac{n!}{y!(n-y)!}\theta^y(1-\theta)^{n-y},\qquad y=0,1,\ldots,n. \]
The likelihood of \(\theta\) in the binomial model above is given by \[ L(\theta|y) \propto \theta^y(1-\theta)^{n-y}, \qquad \mbox{for} \ \theta \in (0,1), \] which resembles the kernel of a Beta distribution for \(\theta\) with parameters \(y+1\) and \(n-y+1\), i.e. \(Beta(y+1,n-y+1)\). It is also relatively easy to see that the maximum likelihood estimator of \(\theta\) is the sample proportion of successes: \[ {\hat \theta}_{mle} = \frac{y}{n} = \frac{x_1+\cdots+x_n}{n} \]
Recalling that \(\theta \sim Beta(\alpha,\beta)\), for \(\alpha,\beta>0\), if and only if its p.d.f. is given by \[ p(\theta|\alpha,\beta) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} \theta^{\alpha-1}(1-\theta)^{\beta-1}. \] Also, \[ E(\theta|\alpha,\beta)=\frac{\alpha}{\alpha+\beta} \ \ \ \mbox{and} \ \ \ V(\theta|\alpha,\beta)=\frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}. \] In the following exercise, we assume that \(\theta\) is, a priori, Beta with parameters \(a_0\) and \(b_0\), i.e. \(\theta \sim Beta(a_0,b_0)\).
Since the likelihood function looks a Beta distribution and the prior is a Beta distribution, it follows that the posterior of \(\theta\) is also a Beta distribution. We only have to collect the expoents of \(\theta\) and of \(1-\theta\), i.e. \[ \theta|y \sim Beta(a_0+y,b_0+n-y) \equiv Beta(a_1,b_1), \] such that \[ E(\theta|y,a_0,b_0)=\frac{a_0+y}{a_0+b_0+n} \ \ \ \mbox{and} \ \ \ V(\theta|y,a_0,b_0)=\frac{(a_0+y)(b_0+n-y)}{(a_0+b_0+n)^2(a_0+b_0+n+1)}. \] Notice that the posterior variance goes to zero (therefore the whole posterior goes to a point!) as \(n\) goes to infinity.
The prior predictive is the expectation of the likelihood with respect to the prior distribution of \(\theta\).
It is easy to show that (you should prove this as an exercise in Bayesian algebra): \[\begin{eqnarray*}
Pr(y) &=& \int_0^1 Pr(y|\theta)p(\theta|a_0,b_0)d\theta\\
&=& \int_0^1 \frac{n!}{y!(n-y)!}\theta^y(1-\theta)^{n-y}
\frac{\Gamma(a_0+b_0)}{\Gamma(a_0)\Gamma(b_0)} \theta^{a_0-1}(1-\theta)^{b_0-1}d\theta\\
&=& \frac{n!}{y!(n-y)!} \times \frac{\Gamma(a_0+b_0)}{\Gamma(a_0)\Gamma(b_0)} \times \frac{\Gamma(a_0+y)\Gamma(b_0+n-y)}{\Gamma(a_0+b_0+n)}, \qquad \mbox{for} \ y=0,1,\ldots,n.
\end{eqnarray*}\] When \(a_0=b_0=1\), i.e. the prior of \(\theta\) is the uniform on \((0,1)\)), it follows that \[
Pr(y) = \frac{1}{n+1}, \qquad \mbox{for} \ y=0,1,\ldots,n,
\] or the discrete uniform prior predictive on \(y\).
Similarly, the posterior predictive is the expection of the likelihood of a new observation, say \(x_{n+1}\), with respect to the posterior distribution of \(\theta\). Again, it is easier to show that (you should prove this too!): \[ Pr(x_{n+1}|y) = \int_0^1 \theta^{x_{n+1}}(1-\theta)^{1-x_{n+1}}p(\theta|y)d\theta = \left\{ \begin{array}{cc} \frac{b_0+n-y}{a_0+b_0+n}=\frac{b_1}{a_1+b_1}=1-E(\theta|y) & \mbox{if} \ x_{n+1}=0\\ \frac{a_0+y}{a_0+b_0+n}=E(\theta|y) & \mbox{if} \ x_{n+1}=1 \end{array} \right. . \] If you want to make this exercise more realistic, instead of one and only one observation \(x_{n+1}\), perhaps you should figure out what is the posterior predictive of \({\tilde y} = x_{n+1}+x_{n+2}+\cdots+x_{n+{\tilde n}}\) for a new set of i.i.d. Bernoullis of size \({\tilde n}\). The pair \(({\tilde n},{\tilde y})\) would be a testing sample, while \((n,y)\) would be the training sample. The posterior preditive would be \[\begin{eqnarray*} Pr({\tilde y}|y,n,{\tilde n}) &=& \int_0^1 p({\tilde y}|{\tilde n},\theta)p(\theta|n,y)d\theta\\ &=& \frac{{\tilde n}!}{{\tilde y}!({\tilde n}-{\tilde y})!} \times \frac{\Gamma(a_0+b_0+n)}{\Gamma(a_0+y)\Gamma(b_0+n-y)} \times \frac{\Gamma(a_0+y+{\tilde y})\Gamma(b_0+n+{\tilde n}-y-{\tilde y})}{\Gamma(a_0+b_0+n+{\tilde n})}, \qquad {\tilde y}=0,1,\ldots,{\tilde n}. \end{eqnarray*}\]
n = 6
y = 4
a0 = 2
b0 = 2
abs = matrix(c(2.5,2.5,2,4),2,2,byrow=TRUE)
par(mfrow=c(1,2))
for (i in 1:2){
a0 = abs[i,1]
b0 = abs[i,2]
# Maximum likelihood inference
th.mle = y/n
# Prior predictive
py = choose(n,y)/beta(a0,b0)*beta(a0+y,b0+n-y)
py
# Bayesian inference
a1 = a0+y
b1 = b0+n-y
mean = (a0+y)/(a0+b0+n)
median = (a0+y-1/3)/(a0+b0+n-2/3)
mode = (a0+y-1)/(a0+b0+n-2)
c(th.mle,mean,median,mode)
# Posterior predictive
px = (a0+y)/(a0+b0+n)
px
curve(dbeta(x,a1,b1),from=0,to=1,n=1000,xlab=expression(theta),ylab="Density",lwd=2)
curve(dbeta(x,a0,b0),from=0,to=1,n=1000,col=2,add=TRUE,lwd=2)
curve(dbeta(x,y+1,n-y+1),from=0,to=1,n=1000,col=3,add=TRUE,lwd=2)
abline(v=th.mle,col=6)
abline(v=mean,lty=2,col=3,lwd=2)
abline(v=median,lty=2,col=4,lwd=2)
abline(v=mode,lty=2,col=5,lwd=2)
legend("topright",legend=c("Prior","Likelihood","Posterior"),col=c(2,3,1),lty=1,bty="n",lwd=2)
title(paste("Data/model: n=",n," - y=",y,"\n Prior: a0=",a0," - b0=",b0,sep=""))
legend("topleft",legend=c(paste("Pr(y)=",round(py,3),sep=""),paste("Pr(xnew=1|y)=",round(px,3),sep="")),bty="n")
}
Here we compare prior and posterior predictive probabilities based on two prior: \[\begin{eqnarray*} \mbox{Prior 1} &:& Beta(a_0,b_0)\\ \mbox{Prior 2} &:& \pi_1 Beta(a_{01},b_{01}) + \pi_2 Beta(a_{01},b_{01})+ (1-\pi_1-\pi_2) Beta(a_{01},b_{01})\\ \end{eqnarray*}\] Posterior and predictive quantities are obtained via SIR-based MC approximation.
n = 6
y = 4
a0 = 2
b0 = 8
a1 = a0+y
b1 = b0+n-y
py = choose(n,y)/beta(a0,b0)*beta(a1,b1)
px = a1/(a1+b1)
prior = function(theta){
0.3*dbeta(theta,10,30)+0.5*dbeta(theta,5,5)+0.2*dbeta(theta,8,2)
}
like= function(theta){
dbinom(y,n,theta)
}
thetas = seq(0,1,length=1000)
par(mfrow=c(1,1))
plot(thetas,prior(thetas),xlab=expression(theta),xlim=c(0,1),main="Prior 2 of theta",type="l",lwd=3)
set.seed(1214353)
M = 100000
ind = sample(1:3,size=M,replace=TRUE,prob=c(0.3,0.5,0.2))
nn = c(sum(ind==1),sum(ind==2),sum(ind==3))
theta.draw = rep(0,M)
theta.draw[ind==1] = rbeta(nn[1],10,30)
theta.draw[ind==2] = rbeta(nn[2],5,5)
theta.draw[ind==3] = rbeta(nn[3],8,2)
w = like(theta.draw)
theta.post = sample(theta.draw,size=M,replace=TRUE,prob=w)
py1 = mean(like(theta.draw))
px1 = mean(dbinom(1,1,theta.post))
thetas = seq(0,1,length=1000)
par(mfrow=c(1,1))
hist(theta.post,prob=TRUE,breaks=50,xlab=expression(theta),xlim=c(0,1),main="",ylim=c(0,3.5))
lines(thetas,prior(thetas),lwd=3,lty=3)
curve(dbeta(x,a1,b1),from=0,to=1,n=1000,lwd=3,col=2,add=TRUE)
curve(dbeta(x,a0,b0),from=0,to=1,n=1000,lwd=3,col=2,add=TRUE,lty=3)
points(y/n,0,pch=16,cex=2,col=3)
title(paste("Bayes factor = ",round(py1,3),"/",round(py,3),"=",round(py1/py,3),sep=""))
legend("topright",legend=c(paste("Pr(xnew=1|y,Prior 1)=",round(px,3),sep=""),
paste("Pr(xnew=1|y,Prior 2)=",round(px1,3),sep="")),bty="n")