Objectives

The main objectives of this exercise are fulfilled if you

Binomial model

As in previous examples, let us assume that \(x_1,\ldots,x_n\) are i.i.d. Bernoulli trials with \(\theta\) the probability of sucess, i.e. \[ Pr(x_i=1|\theta) \ \ \ \mbox{and} \ \ \ Pr(x_i=0|\theta)=1-\theta. \] Therefore, it is easy to see that, given \(\theta\), \(y=x_1+\cdots+x_n\) is Binomial with \(n\) trials and probability of sucess \(\theta\), denoted by \(y \sim Bin(n,\theta)\). Its probability density function is given by \[ Pr(y|\theta) = \frac{n!}{y!(n-y)!}\theta^y(1-\theta)^{n-y},\qquad y=0,1,\ldots,n. \]

Likelihood function

The likelihood of \(\theta\) in the binomial model above is given by \[ L(\theta|y) \propto \theta^y(1-\theta)^{n-y}, \qquad \mbox{for} \ \theta \in (0,1), \] which resembles the kernel of a Beta distribution for \(\theta\) with parameters \(y+1\) and \(n-y+1\), i.e. \(Beta(y+1,n-y+1)\). It is also relatively easy to see that the maximum likelihood estimator of \(\theta\) is the sample proportion of successes: \[ {\hat \theta}_{mle} = \frac{y}{n} = \frac{x_1+\cdots+x_n}{n} \]

Prior distribution

Recalling that \(\theta \sim Beta(\alpha,\beta)\), for \(\alpha,\beta>0\), if and only if its p.d.f. is given by \[ p(\theta|\alpha,\beta) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} \theta^{\alpha-1}(1-\theta)^{\beta-1}. \] Also, \[ E(\theta|\alpha,\beta)=\frac{\alpha}{\alpha+\beta} \ \ \ \mbox{and} \ \ \ V(\theta|\alpha,\beta)=\frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}. \] In the following exercise, we assume that \(\theta\) is, a priori, Beta with parameters \(a_0\) and \(b_0\), i.e. \(\theta \sim Beta(a_0,b_0)\).

Posterior inference

Since the likelihood function looks a Beta distribution and the prior is a Beta distribution, it follows that the posterior of \(\theta\) is also a Beta distribution. We only have to collect the expoents of \(\theta\) and of \(1-\theta\), i.e. \[ \theta|y \sim Beta(a_0+y,b_0+n-y) \equiv Beta(a_1,b_1), \] such that \[ E(\theta|y,a_0,b_0)=\frac{a_0+y}{a_0+b_0+n} \ \ \ \mbox{and} \ \ \ V(\theta|y,a_0,b_0)=\frac{(a_0+y)(b_0+n-y)}{(a_0+b_0+n)^2(a_0+b_0+n+1)}. \] Notice that the posterior variance goes to zero (therefore the whole posterior goes to a point!) as \(n\) goes to infinity.

Prior predictive

The prior predictive is the expectation of the likelihood with respect to the prior distribution of \(\theta\).
It is easy to show that (you should prove this as an exercise in Bayesian algebra): \[\begin{eqnarray*} Pr(y) &=& \int_0^1 Pr(y|\theta)p(\theta|a_0,b_0)d\theta\\ &=& \int_0^1 \frac{n!}{y!(n-y)!}\theta^y(1-\theta)^{n-y} \frac{\Gamma(a_0+b_0)}{\Gamma(a_0)\Gamma(b_0)} \theta^{a_0-1}(1-\theta)^{b_0-1}d\theta\\ &=& \frac{n!}{y!(n-y)!} \times \frac{\Gamma(a_0+b_0)}{\Gamma(a_0)\Gamma(b_0)} \times \frac{\Gamma(a_0+y)\Gamma(b_0+n-y)}{\Gamma(a_0+b_0+n)}, \qquad \mbox{for} \ y=0,1,\ldots,n. \end{eqnarray*}\] When \(a_0=b_0=1\), i.e. the prior of \(\theta\) is the uniform on \((0,1)\)), it follows that \[ Pr(y) = \frac{1}{n+1}, \qquad \mbox{for} \ y=0,1,\ldots,n, \] or the discrete uniform prior predictive on \(y\).

Posterior predictive

Similarly, the posterior predictive is the expection of the likelihood of a new observation, say \(x_{n+1}\), with respect to the posterior distribution of \(\theta\). Again, it is easier to show that (you should prove this too!): \[ Pr(x_{n+1}|y) = \int_0^1 \theta^{x_{n+1}}(1-\theta)^{1-x_{n+1}}p(\theta|y)d\theta = \left\{ \begin{array}{cc} \frac{b_0+n-y}{a_0+b_0+n}=\frac{b_1}{a_1+b_1}=1-E(\theta|y) & \mbox{if} \ x_{n+1}=0\\ \frac{a_0+y}{a_0+b_0+n}=E(\theta|y) & \mbox{if} \ x_{n+1}=1 \end{array} \right. . \] If you want to make this exercise more realistic, instead of one and only one observation \(x_{n+1}\), perhaps you should figure out what is the posterior predictive of \({\tilde y} = x_{n+1}+x_{n+2}+\cdots+x_{n+{\tilde n}}\) for a new set of i.i.d. Bernoullis of size \({\tilde n}\). The pair \(({\tilde n},{\tilde y})\) would be a testing sample, while \((n,y)\) would be the training sample. The posterior preditive would be \[\begin{eqnarray*} Pr({\tilde y}|y,n,{\tilde n}) &=& \int_0^1 p({\tilde y}|{\tilde n},\theta)p(\theta|n,y)d\theta\\ &=& \frac{{\tilde n}!}{{\tilde y}!({\tilde n}-{\tilde y})!} \times \frac{\Gamma(a_0+b_0+n)}{\Gamma(a_0+y)\Gamma(b_0+n-y)} \times \frac{\Gamma(a_0+y+{\tilde y})\Gamma(b_0+n+{\tilde n}-y-{\tilde y})}{\Gamma(a_0+b_0+n+{\tilde n})}, \qquad {\tilde y}=0,1,\ldots,{\tilde n}. \end{eqnarray*}\]

n  = 6
y  = 4
a0 = 2
b0 = 2
abs = matrix(c(2.5,2.5,2,4),2,2,byrow=TRUE)
par(mfrow=c(1,2))
for (i in 1:2){
  a0 = abs[i,1]
  b0 = abs[i,2]
  # Maximum likelihood inference
  th.mle = y/n

  # Prior predictive
  py = choose(n,y)/beta(a0,b0)*beta(a0+y,b0+n-y)
  py

  # Bayesian inference
  a1     = a0+y
  b1     = b0+n-y
  mean   = (a0+y)/(a0+b0+n)
  median = (a0+y-1/3)/(a0+b0+n-2/3) 
  mode   = (a0+y-1)/(a0+b0+n-2)
  c(th.mle,mean,median,mode)

  # Posterior predictive
  px = (a0+y)/(a0+b0+n)
  px

  curve(dbeta(x,a1,b1),from=0,to=1,n=1000,xlab=expression(theta),ylab="Density",lwd=2)
  curve(dbeta(x,a0,b0),from=0,to=1,n=1000,col=2,add=TRUE,lwd=2)
  curve(dbeta(x,y+1,n-y+1),from=0,to=1,n=1000,col=3,add=TRUE,lwd=2)
  abline(v=th.mle,col=6)
  abline(v=mean,lty=2,col=3,lwd=2)
  abline(v=median,lty=2,col=4,lwd=2)
  abline(v=mode,lty=2,col=5,lwd=2)
  legend("topright",legend=c("Prior","Likelihood","Posterior"),col=c(2,3,1),lty=1,bty="n",lwd=2)
  title(paste("Data/model: n=",n," - y=",y,"\n Prior: a0=",a0," - b0=",b0,sep=""))
  legend("topleft",legend=c(paste("Pr(y)=",round(py,3),sep=""),paste("Pr(xnew=1|y)=",round(px,3),sep="")),bty="n")
}

Comparing two alternative prior specifications

Here we compare prior and posterior predictive probabilities based on two prior: \[\begin{eqnarray*} \mbox{Prior 1} &:& Beta(a_0,b_0)\\ \mbox{Prior 2} &:& \pi_1 Beta(a_{01},b_{01}) + \pi_2 Beta(a_{01},b_{01})+ (1-\pi_1-\pi_2) Beta(a_{01},b_{01})\\ \end{eqnarray*}\] Posterior and predictive quantities are obtained via SIR-based MC approximation.

Prior 1 - closed form solution

n  = 6
y  = 4
a0 = 2
b0 = 8
a1 = a0+y
b1 = b0+n-y
py = choose(n,y)/beta(a0,b0)*beta(a1,b1)
px = a1/(a1+b1)

Prior 2 - Finite mixture of Beta distributions

prior = function(theta){
  0.3*dbeta(theta,10,30)+0.5*dbeta(theta,5,5)+0.2*dbeta(theta,8,2)
}
like= function(theta){
  dbinom(y,n,theta)
}

thetas = seq(0,1,length=1000)
par(mfrow=c(1,1))
plot(thetas,prior(thetas),xlab=expression(theta),xlim=c(0,1),main="Prior 2 of theta",type="l",lwd=3)

SIR scheme

set.seed(1214353)
M     = 100000
ind   = sample(1:3,size=M,replace=TRUE,prob=c(0.3,0.5,0.2))
nn    = c(sum(ind==1),sum(ind==2),sum(ind==3))
theta.draw = rep(0,M)
theta.draw[ind==1] = rbeta(nn[1],10,30)
theta.draw[ind==2] = rbeta(nn[2],5,5)
theta.draw[ind==3] = rbeta(nn[3],8,2)
w = like(theta.draw)
theta.post = sample(theta.draw,size=M,replace=TRUE,prob=w)

py1 = mean(like(theta.draw))

px1 = mean(dbinom(1,1,theta.post))

Graphical summary

thetas = seq(0,1,length=1000)
par(mfrow=c(1,1))
hist(theta.post,prob=TRUE,breaks=50,xlab=expression(theta),xlim=c(0,1),main="",ylim=c(0,3.5))
lines(thetas,prior(thetas),lwd=3,lty=3)
curve(dbeta(x,a1,b1),from=0,to=1,n=1000,lwd=3,col=2,add=TRUE)
curve(dbeta(x,a0,b0),from=0,to=1,n=1000,lwd=3,col=2,add=TRUE,lty=3)
points(y/n,0,pch=16,cex=2,col=3)
title(paste("Bayes factor = ",round(py1,3),"/",round(py,3),"=",round(py1/py,3),sep=""))
legend("topright",legend=c(paste("Pr(xnew=1|y,Prior 1)=",round(px,3),sep=""),
paste("Pr(xnew=1|y,Prior 2)=",round(px1,3),sep="")),bty="n")