Bernoulli model

In class we talked about the Bernoulli model where, conditionally on \(\theta\), \[ x_1,\ldots,x_n \ \ \ \mbox{are i.i.d. Bernoulli}(\theta), \] for \(0 \leq \theta \leq 1\), such that \[ Pr(x_i|\theta)=\theta^{x_i}(1-\theta)^{1-x_i} \qquad \mbox{for} \ \ x_i=0,1, \] and \(i=1,\ldots,n\). Also, let \(s_n=\sum_{i=1}^n x_i\) be the number of successes out of \(n\) trials.

Binomial model and likelihood function

It is relatively easy to show that \(s_n|\theta \sim Binomial(n,\theta)\), i.e. \[ Pr(s_n| \theta) = \frac{n!}{k!(n-s_n)!}\theta^{s_n}(1-\theta)^{n-s_n}, \ \mbox{for} \ s_n=0,1,2,\ldots,n, \] and that \(s_n=\sum_{i=1}^n x_i\) is a sufficient statistics for \(\theta\). Obviously, the likelihood function is \[ L(\theta|s_n) \propto \theta^{s_n}(1-\theta)^{n-s_n}, \] which resembles the kernel of a Beta distribution with parametes \(s_n+1\) and \(n-s_n+1\).

Data summaries and prior hyperparameters

Let us first introduce some constants:

n   = 10
sn  =  7
a   =  4
b   =  2
mu  =  0
sig =  3
M   = 100000

Now, let us now consider three prior specifications:

Prior specification

Let us first introduce the Beta distribution. If \(\theta \sim Beta(\alpha,\beta)\) then \[ p(\theta|\alpha,\beta)= \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\theta^{\alpha-1}(1-\theta)^{\beta-1}, \qquad \alpha,\beta>0, \] and that \[ E(\theta) = \frac{\alpha}{\alpha+\beta} \ \ \ \mbox{and} \ \ \ V(\theta) = \frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}. \] We are now ready to introduce three of the four prior specifications studied in this exercise:

  • Prior A: \(\theta \sim Uniform(0,1)\) (this is actually Prior B with \(a=b=1\)),

  • Prior B: \(\theta \sim Beta(a,b)\) (below we will assume the values \(a=4\) and \(b=2\)),

  • Prior C: \(\log(\theta/(1-\theta) )\sim Normal(\mu,\sigma^2)\) (below we will use \(\mu=0\) and \(\sigma^2=9\)).

It is easy to see that, for \(\theta \in (0,1)\), \[\begin{eqnarray*} p(\theta|A) &=& 1\\ p(\theta|B) &=& \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}\theta^{a-1}(1-\theta)^{b-1}\\ p(\theta|C) &=& (2\pi\sigma^2)^{-1/2}\exp\left\{-\frac{1}{2\sigma^2}\left[\log\left(\frac{\theta}{1-\theta}\right)-\mu\right]^2\right\} \frac{1}{\theta(1-\theta)} \end{eqnarray*}\]

Predictive and posterior densities

When working on the following questions, you should assume that \(n=10\) Bernoulli trials lead to \(s_{10}=7\) successes.

a)+b) Derive \(p(\theta |s_n,A)\) and \(p(\theta |s_n,B)\)

Combining \(L(\theta|s_n)\) with \(p(\theta|A)\) or \(p(\theta|B)\): \[ p(\theta |s_n,A) \propto\theta^{s_n}(1-\theta)^{n-s_n} \ \ \ \mbox{and} \ \ \ p(\theta |s_n,B) \propto\theta^{(a+s_n)-1}(1-\theta)^{(b+n-s_n)-1}, \] which are the kernels \(Beta(s_n+1,n-s_n+1)\) and \(Beta(a+s_n,b+n-s_n)\).

Let us compute these quantities:

a.A = sn+1
b.A = n-sn+1 
a.B = a+sn
b.B = b+n-sn

# Posterior means, variances and standard deviations
E.A  = a.A/(a.A+b.A)
V.A  = a.A*b.A/((a.A+b.A)^2*(a.A+b.A+1))
SD.A = sqrt(V.A)
E.B  = a.B/(a.B+b.B)
V.B  = a.B*b.B/((a.B+b.B)^2*(a.B+b.B+1))
SD.B = sqrt(V.B)

Therefore, \[ (\theta|s_{10}=7,A) \sim Beta(8,4) \ \ \ \mbox{and} \ \ \ (\theta|s_{10}=7,B) \sim Beta(11,5). \]

c) Derive \(Pr(s_n|A)\) and \(Pr(s_n|B)\)

These quantities are known scalars that represents the predictive densities of the observed data under the Bernoulli model and Prior A or Prior B, respectively. Since both Prior A and Prior B conjugate with the likelihood function, the easiest way to compute both predictive densities is as follows (for Prior B): \[ Pr(s_n) = \frac{Pr(s_n|\theta)p(\theta)}{p(\theta|s_n)} = \frac{\left(\frac{n!}{s_n!(n-s_n!)}\theta^{s_n}(1-\theta)^{n-s_n}\right) \left(\frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}\theta^{a-1}(1-\theta)^{b-1}\right)} {\frac{\Gamma(a+b+n)}{\Gamma(a+s_n)\Gamma(b+n-s_n)}\theta^{(a+s_n)-1}(1-\theta)^{(b+n-sn)-1}}= \frac{\frac{n!}{s_n!(n-s_n!)} \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}} {\frac{\Gamma(a+b+n)}{\Gamma(a+s_n)\Gamma(b+n-s_n)}} \]

pred = function(sn,n,a,b){
  num = choose(n,sn)*gamma(a+b)/gamma(a)/gamma(b)
  den = gamma(a+b+n)/gamma(a+sn)/gamma(b+n-sn)
  return(num/den)
}
pred.A = pred(sn,n,1,1)
pred.B = pred(sn,n,a,b)

The Predictive density based on Priors A and B are \[ Pr(s_{10}=7|A) = 0.0909091 \ \ \ \mbox{and} \ \ \ Pr(s_{10}=7|B) = 0.1598402. \] Note 1: Since Prior A is \(U(0,1)\), it is easy to see that \(Pr(s_n=k|A)=1/(n+1)\) for \(k=0,\ldots,n\), i.e. constant for all \(n+1\) possible values of \(s_n\). Therefore, \(s_n|A\) is a discrete uniform distribution on the set \(\{0,1,\ldots,n\}\).

Note 2: \(Pr(s_{10}=7|A)<Pr(s_{10}=7|B)\) indicates that the observed data \(s_{10}=7\) when combined with the binomial model and Prior B has higher predictive value, or higher marginal likelihood (marginalized over \(\theta\)). These quantities are useful for model comparison and the Bayes factor is the ratio of marginal likelihoods: \[ \frac{Pr(s_{10}=7|A)}{Pr(s_{10}=7|B)}=\frac{0.0909091}{0.1598402} = 0.56875 \] which does not show overwhelming evidence in favor of Prior B over Prior A.

Bayes factors

For any Beta prior with parameters \(a\) and \(b\), \(Beta(a,b)\), and Binomial model, \(Binomial(n,\theta)\) and data quantities \(n\) and \(s_n\), we can write down the whole predictive density before observing any data: \[ Pr(s_n|n,s_n,a,b) = \frac{\frac{n!}{s_n!(n-s_n)!}\frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}}{\frac{\Gamma(a+b+n)}{\Gamma(a+s_n)\Gamma(b+n-s_n)}}, \] for \(s_n=0,1,\ldots,n\). Here is a bit of R code:

sns     = 0:n
preds.A = pred(sns,n,1,1)
preds.B = pred(sns,n,4,2)
BF.AB   = preds.A/preds.B

par(mfrow=c(1,2))
plot(sns,preds.B,type="h",xlab="Sucesses (sn)",ylab="Predictive",lwd=2)
lines(sns+0.25,preds.A,col=2,lwd=2,type="h")
points(sn,0,col=3,pch=16,cex=2)
legend("topleft",legend=c("Predictive A","Predictive B"),
       col=2:1,lty=1,lwd=2,bty="n")
plot(sns,BF.AB,xlab="Sucesses (sn)",ylab="Bayes factor of A vs B",type="h",lwd=2)
abline(h=1,lty=2)

From these Bayes factors, we can argue that had we observed \(s_{10} \leq 2\) then one might be inclined to using Prior A instead of Prior B, while for \(s_{10}>2\) there is little evidence in either direction.

d) Derive \(Pr(s_n|C)\) first and then \(p(\theta |s_n,C)\)

These are the complicated ones, as far as computation is concerned, since the prior density and likelihood function fail to conjugate. Therefore, you will need to approximate the denominator of \[ p(\theta |s_n,C) = \frac{p(s_n|\theta))p(\theta|C)}{\int_0^1 p(s_n|\theta)p(\theta|C)d\theta} = \frac{p(s_n|\theta))p(\theta|C)}{Pr(s_n|C)} \] Approximate \(Pr(s_n|C)\) by simple Monte Carlo integration.

\[\begin{eqnarray*} I = Pr(s_n|C) &=& \int_0^1 p(s_n|\theta)p(\theta|C)d\theta\\ &=& \int_0^1 \frac{n!}{k!(n-s_n)!}\theta^{s_n}(1-\theta)^{n-s_n} (2\pi\sigma^2)^{-1/2}\exp\left\{-\frac{1}{2\sigma^2}\left[\log\left(\frac{\theta}{1-\theta}\right)-\mu\right]^2\right\} \frac{1}{\theta(1-\theta)} d\theta, \end{eqnarray*}\] which can be approximated by Monte Carlo integration by sampling \(\theta^{(1)},\ldots,\theta^{(M)}\) for large \(M\), say \(M=10^{5}\), from a proposal or candidate distribution. The natural choice in this context is the Uniform\((0,1)\): \[ \hat{I}_{MC} = \frac{1}{M} \sum_{i=1}^M p(s_n|\theta^{(i)})p(\theta^{(i)}|C). \] Large sample theory guarantees that \(\hat{I}_{MC} \longrightarrow I\) as \(M \rightarrow \infty\).

priorC = function(theta,mu,sig){
  dnorm(log(theta/(1-theta)),mu,sig)/(theta*(1-theta))
}
set.seed(2468)
thetas   = runif(M)
eval     = dbinom(sn,n,thetas)*priorC(thetas,mu,sig)
preds.C  = cumsum(eval)/1:M
pred.C   = preds.C[M]
MC.error = sqrt(mean((eval-pred.C)^2)/M)
lim      = c(pred.C-2*MC.error,pred.C+2*MC.error)
c(pred.C,MC.error,lim)
## [1] 0.0586217835 0.0001925212 0.0582367411 0.0590068259
par(mfrow=c(1,1))
ts.plot(preds.C,xlab="Number of Monte Carlo simulations",ylab="Predictive C",ylim=range(preds.C[100:M],lim))
abline(h=pred.C,col=2)
abline(h=lim[1],col=2,lty=2)
abline(h=lim[2],col=2,lty=2)

e) Graphically compare the three priors and posteriors

theta = seq(0,1,length=1000)
plot(theta,dbeta(theta,a.B,b.B),type="l",xlab=expression(theta),
     ylab="Density",lwd=2,col=2)
lines(theta,dbeta(theta,a.A,b.A),col=1,lwd=2)
lines(theta,dbeta(theta,a,b),col=2,lwd=2,lty=2)
lines(theta,dbeta(theta,1,1),col=1,lwd=2,lty=2)
lines(theta,priorC(theta,mu,sig),col=4,lwd=2,lty=2)
lines(theta,dbinom(sn,n,theta)*priorC(theta,mu,sig)/pred.C,col=4,lwd=2)
points(sn/n,0,pch=16,col=3,cex=2)
legend("top",legend=c("Prior A","Prior B","Prior C:","Posterior A","Posterior B",
       "Posterior C"),col=c(1,2,4,1,2,4),lwd=2,lty=c(2,2,2,1,1,1),bty="n")

f) Posterior moments

Compute \(E(\theta|s_n,A)\) and \(V(\theta|s_n,A)\). Repeat for Priors B and C.

For Priors A and B,it follows directly that the posteriors are Beta\((1+s_n,1+n-s_n)\) and Beta\((a+s_n,b+n-s_n)\), so \[ E(\theta|s_n,A) = \frac{1+s_n}{2+n}=0.6667 \ \ \mbox{and} \ \ E(\theta|s_n,B) = \frac{a+s_n}{a+b+n}=0.6875 \] and \[ V(\theta|s_n,A) = \frac{(1+s_n)(1+n-s_n)}{(2+n)^2(3+n)} = 0.017094 \ \ \ \mbox{and} \ \ \ V(\theta|s_n,B) = \frac{(a+s_n)(b+n-s_n)}{(a+b+n)^2(a+b+n+1)} = 0.0126379 \] such that \(\sqrt{V(\theta|s_n,A)}=0.1307441\) and \(\sqrt{V(\theta|s_n,B)}=0.1124183\).

To compute the \(E(\theta|s_n,C)\) and \(V(\theta|s_n,C)\), we first notice that \[ V(\theta|s_n,C) = E(\theta^2|s_n,C)-\{E(\theta|s_n,C)\}^2, \] so we need to approximate by Monte Carlo the following two integrals: \[\begin{eqnarray*} I_1 &=& \int_0^1 \theta p(s_n|\theta)p(\theta|C)d\theta\\ I_2 &=& \int_0^1 \theta^2 p(s_n|\theta)p(\theta|C)d\theta \end{eqnarray*}\] We repead the MC scheme used to compute \(Pr(s_n|C)\) and use the same \(M\) draws from the \(U(0,1)\) as follows:

I1.mc = mean(thetas*dbinom(sn,n,thetas)*priorC(thetas,mu,sig))
I2.mc = mean(thetas^2*dbinom(sn,n,thetas)*priorC(thetas,mu,sig))
c(I1.mc,I2.mc)
## [1] 0.04038329 0.02890162
E.C  = I1.mc/pred.C
V.C  = I2.mc/pred.C - E.C^2
SD.C = sqrt(V.C)
c(E.C,V.C,SD.C)
## [1] 0.68887851 0.01846488 0.13588554

To sum up, we have the following posterior means and standard deviations for \(\theta\) under Priors A, B and C: \[\begin{eqnarray*} E(\theta|s_n,A) &=& 0.6666667\ \ \ \mbox{and} \ \ V^{1/2}(\theta|s_n,A) = 0.1307441\\ E(\theta|s_n,B) &=& 0.6875\ \ \ \mbox{and} \ \ V^{1/2}(\theta|s_n,B) = 0.1124183\\ E(\theta|s_n,C) &=& 0.6888785\ \ \ \mbox{and} \ \ V^{1/2}(\theta|s_n,C) = 0.1358855 \end{eqnarray*}\]

g) Sampling importance resampling (SIR)

Using SIR obtain \(M=10000\) draws from the posterior \(p(\theta |s_n,C)\).
Then, use these \(M\) draws to compute MC approximations to \(E(\theta|s_n,C)\) and \(V(\theta|s_n,C)\).
They should match the ones you found in f) under prior C.

# recall that thetas are draws form U(0,1)
# SIR weights
w = dbinom(sn,n,thetas)*priorC(thetas,mu,sig)
# resampling
set.seed(3142)
thetas1 = sample(thetas,size=M,replace=TRUE,prob=w)
E1 = mean(thetas1)
V1 = var(thetas1)
SD1 = sqrt(V1)
mat = cbind(c(E.C,V.C,SD.C),c(E1,V1,SD1))
rownames(mat) = c("Mean","Var","StDev")
colnames(mat) = c("MCI","SIR")
mat
##              MCI        SIR
## Mean  0.68887851 0.68852030
## Var   0.01846488 0.01844278
## StDev 0.13588554 0.13580422

h) Compute \(Pr(x_{11}=1|s_{10}=7)\), for Priors \(A\), \(B\) and \(C\)

First, let us derive \(Pr(x_{n+1}=1|s_n)\) for a generic posterior \(p(\theta|s_n)\): \[ Pr(x_{n+1}=1|s_n)=\int_0^1 Pr(x_{n+1}=1|\theta,s_n)p(\theta|s_n)d\theta. \] However, \(x_{n+1}\) and \(s_n\) are independent conditionally on \(\theta\) (this is one of the assumptions of the Bernoulli model we started with). Therefore, \[ Pr(x_{n+1}=1|s_n)=\int_0^1 Pr(x_{n+1}=1|\theta)p(\theta|s_n)d\theta =\int_0^1 \theta p(\theta|s_n)d\theta=E(\theta|s_n). \] Hence, \(Pr(x_{11}=1|s_{10}=7,A)=0.6666667\), \(Pr(x_{11}=1|s_{10}=7,B)=0.6875\) and \(Pr(x_{11}=1|s_{10}=7,C)=0.6888785\). Prior B leads to a (modestly) higher posterior predictive performance.

i) 2-component mixture of Beta densities

Repeat your derivations and computations from a) to h) assuming now the following 2-component mixture of Beta prior densities for \(\theta\), i.e. \[ p(\theta|D) = 0.8\left\{\frac{\Gamma(a_1+b_1)}{\Gamma(a_1)\Gamma(b_1)}\theta^{a_1-1}(1-\theta)^{b_1-1}\right\}+ 0.2\left\{\frac{\Gamma(a_2+b_2)}{\Gamma(a_2)\Gamma(b_2)}\theta^{a_2-1}(1-\theta)^{b_2-1}\right\}. \] for \(a_1=3\), \(b_1=7\), \(a_2=7\) and \(b_2=3\).

Let us go straight to computing the MC approximations via SIR method.

priorD = function(theta){
  0.8*dbeta(theta,3,7)+0.2*dbeta(theta,7,3)
}
w = dbinom(sn,n,thetas)*priorD(thetas)
pred.D = mean(w)
thetas2 = sample(thetas,replace=TRUE,size=M,prob=w)
hist(thetas2,prob=TRUE,xlim=c(0,1),main="",xlab=expression(theta))
lines(theta,priorD(theta),col=2,lwd=2)
points(sn/n,0,pch=16,col=3,cex=2)

E.D = mean(thetas2)
V.D = var(thetas2)
SD.D = sqrt(V.D)
c(E.D,V.D,SD.D)
## [1] 0.61670088 0.02030052 0.14247988
c(pred.A,pred.B,pred.C,pred.D)
## [1] 0.09090909 0.15984016 0.05862178 0.06350793

Therefore, \[\begin{eqnarray*} E(\theta|s_n,A) &=& 0.6666667\ \ \ \mbox{and} \ \ V^{1/2}(\theta|s_n,A) = 0.1307441\\ E(\theta|s_n,B) &=& 0.6875\ \ \ \mbox{and} \ \ V^{1/2}(\theta|s_n,B) = 0.1124183\\ E(\theta|s_n,C) &=& 0.6888785\ \ \ \mbox{and} \ \ V^{1/2}(\theta|s_n,C) = 0.1358855\\ E(\theta|s_n,D) &=& 0.6167009\ \ \ \mbox{and} \ \ V^{1/2}(\theta|s_n,D) = 0.1424799, \end{eqnarray*}\] and \[\begin{eqnarray*} Pr(s_n|A) &=& 0.0909091\\ Pr(s_n|B) &=& 0.1598402\\ Pr(s_n|C) &=& 0.0586218\\ Pr(s_n|D) &=& 0.0635079 \end{eqnarray*}\]

par(mfrow=c(2,2))
plot(theta,dbeta(theta,1,1),type="l",lwd=2,ylim=c(0,3.5),
     ylab="Density",xlab=expression(theta),main="Prior A")
lines(theta,dbeta(theta,1+sn,1+n-sn),col=2,lwd=2)
legend("topleft",legend=c("Prior","Posterior"),col=1:2,lwd=2,bty="n")

plot(theta,dbeta(theta,4,2),type="l",lwd=2,ylim=c(0,3.5),
     ylab="Density",xlab=expression(theta),main="Prior B")
lines(theta,dbeta(theta,4+sn,2+n-sn),col=2,lwd=2)

hist(thetas1,prob=TRUE,xlim=c(0,1),ylim=c(0,3.5),col=2,
     xlab=expression(theta),main="Prior C")
lines(theta,priorC(theta,mu,sig),lwd=2)
box()

hist(thetas2,prob=TRUE,xlim=c(0,1),ylim=c(0,3.5),col=2,
     xlab=expression(theta),main="Prior D")
lines(theta,priorD(theta),lwd=2)
box()