In class we talked about the Bernoulli model where, conditionally on \(\theta\), \[ x_1,\ldots,x_n \ \ \ \mbox{are i.i.d. Bernoulli}(\theta), \] for \(0 \leq \theta \leq 1\), such that \[ Pr(x_i|\theta)=\theta^{x_i}(1-\theta)^{1-x_i} \qquad \mbox{for} \ \ x_i=0,1, \] and \(i=1,\ldots,n\). Also, let \(s_n=\sum_{i=1}^n x_i\) be the number of successes out of \(n\) trials.
It is relatively easy to show that \(s_n|\theta \sim Binomial(n,\theta)\), i.e. \[ Pr(s_n| \theta) = \frac{n!}{k!(n-s_n)!}\theta^{s_n}(1-\theta)^{n-s_n}, \ \mbox{for} \ s_n=0,1,2,\ldots,n, \] and that \(s_n=\sum_{i=1}^n x_i\) is a sufficient statistics for \(\theta\). Obviously, the likelihood function is \[ L(\theta|s_n) \propto \theta^{s_n}(1-\theta)^{n-s_n}, \] which resembles the kernel of a Beta distribution with parametes \(s_n+1\) and \(n-s_n+1\).
Let us first introduce some constants:
n = 10
sn = 7
a = 4
b = 2
mu = 0
sig = 3
M = 100000
Now, let us now consider three prior specifications:
Let us first introduce the Beta distribution. If \(\theta \sim Beta(\alpha,\beta)\) then \[ p(\theta|\alpha,\beta)= \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\theta^{\alpha-1}(1-\theta)^{\beta-1}, \qquad \alpha,\beta>0, \] and that \[ E(\theta) = \frac{\alpha}{\alpha+\beta} \ \ \ \mbox{and} \ \ \ V(\theta) = \frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}. \] We are now ready to introduce three of the four prior specifications studied in this exercise:
Prior A: \(\theta \sim Uniform(0,1)\) (this is actually Prior B with \(a=b=1\)),
Prior B: \(\theta \sim Beta(a,b)\) (below we will assume the values \(a=4\) and \(b=2\)),
Prior C: \(\log(\theta/(1-\theta) )\sim Normal(\mu,\sigma^2)\) (below we will use \(\mu=0\) and \(\sigma^2=9\)).
It is easy to see that, for \(\theta \in (0,1)\), \[\begin{eqnarray*} p(\theta|A) &=& 1\\ p(\theta|B) &=& \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}\theta^{a-1}(1-\theta)^{b-1}\\ p(\theta|C) &=& (2\pi\sigma^2)^{-1/2}\exp\left\{-\frac{1}{2\sigma^2}\left[\log\left(\frac{\theta}{1-\theta}\right)-\mu\right]^2\right\} \frac{1}{\theta(1-\theta)} \end{eqnarray*}\]
When working on the following questions, you should assume that \(n=10\) Bernoulli trials lead to \(s_{10}=7\) successes.
Combining \(L(\theta|s_n)\) with \(p(\theta|A)\) or \(p(\theta|B)\): \[ p(\theta |s_n,A) \propto\theta^{s_n}(1-\theta)^{n-s_n} \ \ \ \mbox{and} \ \ \ p(\theta |s_n,B) \propto\theta^{(a+s_n)-1}(1-\theta)^{(b+n-s_n)-1}, \] which are the kernels \(Beta(s_n+1,n-s_n+1)\) and \(Beta(a+s_n,b+n-s_n)\).
Let us compute these quantities:
a.A = sn+1
b.A = n-sn+1
a.B = a+sn
b.B = b+n-sn
# Posterior means, variances and standard deviations
E.A = a.A/(a.A+b.A)
V.A = a.A*b.A/((a.A+b.A)^2*(a.A+b.A+1))
SD.A = sqrt(V.A)
E.B = a.B/(a.B+b.B)
V.B = a.B*b.B/((a.B+b.B)^2*(a.B+b.B+1))
SD.B = sqrt(V.B)
Therefore, \[ (\theta|s_{10}=7,A) \sim Beta(8,4) \ \ \ \mbox{and} \ \ \ (\theta|s_{10}=7,B) \sim Beta(11,5). \]
These quantities are known scalars that represents the predictive densities of the observed data under the Bernoulli model and Prior A or Prior B, respectively. Since both Prior A and Prior B conjugate with the likelihood function, the easiest way to compute both predictive densities is as follows (for Prior B): \[ Pr(s_n) = \frac{Pr(s_n|\theta)p(\theta)}{p(\theta|s_n)} = \frac{\left(\frac{n!}{s_n!(n-s_n!)}\theta^{s_n}(1-\theta)^{n-s_n}\right) \left(\frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}\theta^{a-1}(1-\theta)^{b-1}\right)} {\frac{\Gamma(a+b+n)}{\Gamma(a+s_n)\Gamma(b+n-s_n)}\theta^{(a+s_n)-1}(1-\theta)^{(b+n-sn)-1}}= \frac{\frac{n!}{s_n!(n-s_n!)} \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}} {\frac{\Gamma(a+b+n)}{\Gamma(a+s_n)\Gamma(b+n-s_n)}} \]
pred = function(sn,n,a,b){
num = choose(n,sn)*gamma(a+b)/gamma(a)/gamma(b)
den = gamma(a+b+n)/gamma(a+sn)/gamma(b+n-sn)
return(num/den)
}
pred.A = pred(sn,n,1,1)
pred.B = pred(sn,n,a,b)
The Predictive density based on Priors A and B are \[ Pr(s_{10}=7|A) = 0.0909091 \ \ \ \mbox{and} \ \ \ Pr(s_{10}=7|B) = 0.1598402. \] Note 1: Since Prior A is \(U(0,1)\), it is easy to see that \(Pr(s_n=k|A)=1/(n+1)\) for \(k=0,\ldots,n\), i.e. constant for all \(n+1\) possible values of \(s_n\). Therefore, \(s_n|A\) is a discrete uniform distribution on the set \(\{0,1,\ldots,n\}\).
Note 2: \(Pr(s_{10}=7|A)<Pr(s_{10}=7|B)\) indicates that the observed data \(s_{10}=7\) when combined with the binomial model and Prior B has higher predictive value, or higher marginal likelihood (marginalized over \(\theta\)). These quantities are useful for model comparison and the Bayes factor is the ratio of marginal likelihoods: \[ \frac{Pr(s_{10}=7|A)}{Pr(s_{10}=7|B)}=\frac{0.0909091}{0.1598402} = 0.56875 \] which does not show overwhelming evidence in favor of Prior B over Prior A.
For any Beta prior with parameters \(a\) and \(b\), \(Beta(a,b)\), and Binomial model, \(Binomial(n,\theta)\) and data quantities \(n\) and \(s_n\), we can write down the whole predictive density before observing any data: \[ Pr(s_n|n,s_n,a,b) = \frac{\frac{n!}{s_n!(n-s_n)!}\frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}}{\frac{\Gamma(a+b+n)}{\Gamma(a+s_n)\Gamma(b+n-s_n)}}, \] for \(s_n=0,1,\ldots,n\). Here is a bit of R code:
sns = 0:n
preds.A = pred(sns,n,1,1)
preds.B = pred(sns,n,4,2)
BF.AB = preds.A/preds.B
par(mfrow=c(1,2))
plot(sns,preds.B,type="h",xlab="Sucesses (sn)",ylab="Predictive",lwd=2)
lines(sns+0.25,preds.A,col=2,lwd=2,type="h")
points(sn,0,col=3,pch=16,cex=2)
legend("topleft",legend=c("Predictive A","Predictive B"),
col=2:1,lty=1,lwd=2,bty="n")
plot(sns,BF.AB,xlab="Sucesses (sn)",ylab="Bayes factor of A vs B",type="h",lwd=2)
abline(h=1,lty=2)
From these Bayes factors, we can argue that had we observed \(s_{10} \leq 2\) then one might be inclined to using Prior A instead of Prior B, while for \(s_{10}>2\) there is little evidence in either direction.
These are the complicated ones, as far as computation is concerned, since the prior density and likelihood function fail to conjugate. Therefore, you will need to approximate the denominator of \[ p(\theta |s_n,C) = \frac{p(s_n|\theta))p(\theta|C)}{\int_0^1 p(s_n|\theta)p(\theta|C)d\theta} = \frac{p(s_n|\theta))p(\theta|C)}{Pr(s_n|C)} \] Approximate \(Pr(s_n|C)\) by simple Monte Carlo integration.
\[\begin{eqnarray*} I = Pr(s_n|C) &=& \int_0^1 p(s_n|\theta)p(\theta|C)d\theta\\ &=& \int_0^1 \frac{n!}{k!(n-s_n)!}\theta^{s_n}(1-\theta)^{n-s_n} (2\pi\sigma^2)^{-1/2}\exp\left\{-\frac{1}{2\sigma^2}\left[\log\left(\frac{\theta}{1-\theta}\right)-\mu\right]^2\right\} \frac{1}{\theta(1-\theta)} d\theta, \end{eqnarray*}\] which can be approximated by Monte Carlo integration by sampling \(\theta^{(1)},\ldots,\theta^{(M)}\) for large \(M\), say \(M=10^{5}\), from a proposal or candidate distribution. The natural choice in this context is the Uniform\((0,1)\): \[ \hat{I}_{MC} = \frac{1}{M} \sum_{i=1}^M p(s_n|\theta^{(i)})p(\theta^{(i)}|C). \] Large sample theory guarantees that \(\hat{I}_{MC} \longrightarrow I\) as \(M \rightarrow \infty\).
priorC = function(theta,mu,sig){
dnorm(log(theta/(1-theta)),mu,sig)/(theta*(1-theta))
}
set.seed(2468)
thetas = runif(M)
eval = dbinom(sn,n,thetas)*priorC(thetas,mu,sig)
preds.C = cumsum(eval)/1:M
pred.C = preds.C[M]
MC.error = sqrt(mean((eval-pred.C)^2)/M)
lim = c(pred.C-2*MC.error,pred.C+2*MC.error)
c(pred.C,MC.error,lim)
## [1] 0.0586217835 0.0001925212 0.0582367411 0.0590068259
par(mfrow=c(1,1))
ts.plot(preds.C,xlab="Number of Monte Carlo simulations",ylab="Predictive C",ylim=range(preds.C[100:M],lim))
abline(h=pred.C,col=2)
abline(h=lim[1],col=2,lty=2)
abline(h=lim[2],col=2,lty=2)
theta = seq(0,1,length=1000)
plot(theta,dbeta(theta,a.B,b.B),type="l",xlab=expression(theta),
ylab="Density",lwd=2,col=2)
lines(theta,dbeta(theta,a.A,b.A),col=1,lwd=2)
lines(theta,dbeta(theta,a,b),col=2,lwd=2,lty=2)
lines(theta,dbeta(theta,1,1),col=1,lwd=2,lty=2)
lines(theta,priorC(theta,mu,sig),col=4,lwd=2,lty=2)
lines(theta,dbinom(sn,n,theta)*priorC(theta,mu,sig)/pred.C,col=4,lwd=2)
points(sn/n,0,pch=16,col=3,cex=2)
legend("top",legend=c("Prior A","Prior B","Prior C:","Posterior A","Posterior B",
"Posterior C"),col=c(1,2,4,1,2,4),lwd=2,lty=c(2,2,2,1,1,1),bty="n")
Compute \(E(\theta|s_n,A)\) and \(V(\theta|s_n,A)\). Repeat for Priors B and C.
For Priors A and B,it follows directly that the posteriors are Beta\((1+s_n,1+n-s_n)\) and Beta\((a+s_n,b+n-s_n)\), so \[ E(\theta|s_n,A) = \frac{1+s_n}{2+n}=0.6667 \ \ \mbox{and} \ \ E(\theta|s_n,B) = \frac{a+s_n}{a+b+n}=0.6875 \] and \[ V(\theta|s_n,A) = \frac{(1+s_n)(1+n-s_n)}{(2+n)^2(3+n)} = 0.017094 \ \ \ \mbox{and} \ \ \ V(\theta|s_n,B) = \frac{(a+s_n)(b+n-s_n)}{(a+b+n)^2(a+b+n+1)} = 0.0126379 \] such that \(\sqrt{V(\theta|s_n,A)}=0.1307441\) and \(\sqrt{V(\theta|s_n,B)}=0.1124183\).
To compute the \(E(\theta|s_n,C)\) and \(V(\theta|s_n,C)\), we first notice that \[ V(\theta|s_n,C) = E(\theta^2|s_n,C)-\{E(\theta|s_n,C)\}^2, \] so we need to approximate by Monte Carlo the following two integrals: \[\begin{eqnarray*} I_1 &=& \int_0^1 \theta p(s_n|\theta)p(\theta|C)d\theta\\ I_2 &=& \int_0^1 \theta^2 p(s_n|\theta)p(\theta|C)d\theta \end{eqnarray*}\] We repead the MC scheme used to compute \(Pr(s_n|C)\) and use the same \(M\) draws from the \(U(0,1)\) as follows:
I1.mc = mean(thetas*dbinom(sn,n,thetas)*priorC(thetas,mu,sig))
I2.mc = mean(thetas^2*dbinom(sn,n,thetas)*priorC(thetas,mu,sig))
c(I1.mc,I2.mc)
## [1] 0.04038329 0.02890162
E.C = I1.mc/pred.C
V.C = I2.mc/pred.C - E.C^2
SD.C = sqrt(V.C)
c(E.C,V.C,SD.C)
## [1] 0.68887851 0.01846488 0.13588554
To sum up, we have the following posterior means and standard deviations for \(\theta\) under Priors A, B and C: \[\begin{eqnarray*} E(\theta|s_n,A) &=& 0.6666667\ \ \ \mbox{and} \ \ V^{1/2}(\theta|s_n,A) = 0.1307441\\ E(\theta|s_n,B) &=& 0.6875\ \ \ \mbox{and} \ \ V^{1/2}(\theta|s_n,B) = 0.1124183\\ E(\theta|s_n,C) &=& 0.6888785\ \ \ \mbox{and} \ \ V^{1/2}(\theta|s_n,C) = 0.1358855 \end{eqnarray*}\]
Using SIR obtain \(M=10000\) draws from the posterior \(p(\theta |s_n,C)\).
Then, use these \(M\) draws to compute MC approximations to \(E(\theta|s_n,C)\) and \(V(\theta|s_n,C)\).
They should match the ones you found in f) under prior C.
# recall that thetas are draws form U(0,1)
# SIR weights
w = dbinom(sn,n,thetas)*priorC(thetas,mu,sig)
# resampling
set.seed(3142)
thetas1 = sample(thetas,size=M,replace=TRUE,prob=w)
E1 = mean(thetas1)
V1 = var(thetas1)
SD1 = sqrt(V1)
mat = cbind(c(E.C,V.C,SD.C),c(E1,V1,SD1))
rownames(mat) = c("Mean","Var","StDev")
colnames(mat) = c("MCI","SIR")
mat
## MCI SIR
## Mean 0.68887851 0.68852030
## Var 0.01846488 0.01844278
## StDev 0.13588554 0.13580422
First, let us derive \(Pr(x_{n+1}=1|s_n)\) for a generic posterior \(p(\theta|s_n)\): \[ Pr(x_{n+1}=1|s_n)=\int_0^1 Pr(x_{n+1}=1|\theta,s_n)p(\theta|s_n)d\theta. \] However, \(x_{n+1}\) and \(s_n\) are independent conditionally on \(\theta\) (this is one of the assumptions of the Bernoulli model we started with). Therefore, \[ Pr(x_{n+1}=1|s_n)=\int_0^1 Pr(x_{n+1}=1|\theta)p(\theta|s_n)d\theta =\int_0^1 \theta p(\theta|s_n)d\theta=E(\theta|s_n). \] Hence, \(Pr(x_{11}=1|s_{10}=7,A)=0.6666667\), \(Pr(x_{11}=1|s_{10}=7,B)=0.6875\) and \(Pr(x_{11}=1|s_{10}=7,C)=0.6888785\). Prior B leads to a (modestly) higher posterior predictive performance.
Repeat your derivations and computations from a) to h) assuming now the following 2-component mixture of Beta prior densities for \(\theta\), i.e. \[ p(\theta|D) = 0.8\left\{\frac{\Gamma(a_1+b_1)}{\Gamma(a_1)\Gamma(b_1)}\theta^{a_1-1}(1-\theta)^{b_1-1}\right\}+ 0.2\left\{\frac{\Gamma(a_2+b_2)}{\Gamma(a_2)\Gamma(b_2)}\theta^{a_2-1}(1-\theta)^{b_2-1}\right\}. \] for \(a_1=3\), \(b_1=7\), \(a_2=7\) and \(b_2=3\).
Let us go straight to computing the MC approximations via SIR method.
priorD = function(theta){
0.8*dbeta(theta,3,7)+0.2*dbeta(theta,7,3)
}
w = dbinom(sn,n,thetas)*priorD(thetas)
pred.D = mean(w)
thetas2 = sample(thetas,replace=TRUE,size=M,prob=w)
hist(thetas2,prob=TRUE,xlim=c(0,1),main="",xlab=expression(theta))
lines(theta,priorD(theta),col=2,lwd=2)
points(sn/n,0,pch=16,col=3,cex=2)
E.D = mean(thetas2)
V.D = var(thetas2)
SD.D = sqrt(V.D)
c(E.D,V.D,SD.D)
## [1] 0.61670088 0.02030052 0.14247988
c(pred.A,pred.B,pred.C,pred.D)
## [1] 0.09090909 0.15984016 0.05862178 0.06350793
Therefore, \[\begin{eqnarray*} E(\theta|s_n,A) &=& 0.6666667\ \ \ \mbox{and} \ \ V^{1/2}(\theta|s_n,A) = 0.1307441\\ E(\theta|s_n,B) &=& 0.6875\ \ \ \mbox{and} \ \ V^{1/2}(\theta|s_n,B) = 0.1124183\\ E(\theta|s_n,C) &=& 0.6888785\ \ \ \mbox{and} \ \ V^{1/2}(\theta|s_n,C) = 0.1358855\\ E(\theta|s_n,D) &=& 0.6167009\ \ \ \mbox{and} \ \ V^{1/2}(\theta|s_n,D) = 0.1424799, \end{eqnarray*}\] and \[\begin{eqnarray*} Pr(s_n|A) &=& 0.0909091\\ Pr(s_n|B) &=& 0.1598402\\ Pr(s_n|C) &=& 0.0586218\\ Pr(s_n|D) &=& 0.0635079 \end{eqnarray*}\]
par(mfrow=c(2,2))
plot(theta,dbeta(theta,1,1),type="l",lwd=2,ylim=c(0,3.5),
ylab="Density",xlab=expression(theta),main="Prior A")
lines(theta,dbeta(theta,1+sn,1+n-sn),col=2,lwd=2)
legend("topleft",legend=c("Prior","Posterior"),col=1:2,lwd=2,bty="n")
plot(theta,dbeta(theta,4,2),type="l",lwd=2,ylim=c(0,3.5),
ylab="Density",xlab=expression(theta),main="Prior B")
lines(theta,dbeta(theta,4+sn,2+n-sn),col=2,lwd=2)
hist(thetas1,prob=TRUE,xlim=c(0,1),ylim=c(0,3.5),col=2,
xlab=expression(theta),main="Prior C")
lines(theta,priorC(theta,mu,sig),lwd=2)
box()
hist(thetas2,prob=TRUE,xlim=c(0,1),ylim=c(0,3.5),col=2,
xlab=expression(theta),main="Prior D")
lines(theta,priorD(theta),lwd=2)
box()