1 iid Bernoulli model

Here we assume that the data \(y_{1:n}=\{y_1,\ldots,y_n\}\) is modeled as a random sample from \(Bernoulli(\theta)\). In our classroom example, the binary random variable \(y_i=1\) if individual \(i\) (attending the class) has three or more siblings, including himself or herself, with \(y_i=0\), otherwise. One of the main goals is to access the uncertainty around \(\theta\), i.e. the posterior \(p(\theta|y_{1:n})\), and then obtain \(Pr(y_{n+1}=1|y_{1:n})\), the probability that the \((n+1)\) individual has three or more siblings, including himself or herself.

1.1 Maximum likelihood estimation

From basic likelihood-based statistical inference, it is easy to see that \[ {\widehat \theta}_{mle} \equiv \mbox{arg}\max_{\theta \in (0,1)} L(\theta;y_{1:n}), \] where the likelihood function \(L(\theta;\mbox{data})\) is given by \[ L(\theta;\mbox{data}) = \prod_{i=1}^n \theta^{y_i}(1-\theta)^{1-y_i} = \theta^{s_n}(1-\theta)^{n-s_n}, \] where \(s_n = \sum_{i=1}^n y_i\) is the number of sucesses out of \(n\) (Bernoulli) trials. We compute the MLE of \(\theta\) based on \(n=10\) observations and \(y_{1:10}=\{0,0,0,1,0,1,0,1,0,1\}\), so $s_{10}=4 and \({\widehat \theta}_{mle}=4/10=0.4\).

n = 10
y = c(0,0,0,1,0,1,0,1,0,1)
s = sum(y==1)

mle = s/n
c(n,s,mle)
## [1] 10.0  4.0  0.4

1.2 Prior, likelihood and posterior for \(\theta\)

We have elicited a prior for \(\theta\) in class (with the help of Vitória) and arrived at the following Beta distribution: \(\theta \sim Beta(a,b)\) where \(a=2\) and \(b=6\), such that \[\begin{eqnarray*} E(\theta) &=& \frac{a}{a+b}\\ Mode(\theta) &=& \frac{a-1}{a+b-2}, \quad \qquad a,b>1\\ Median(\theta) &=& \frac{a-1/3}{a+b-2/3}, \qquad a,b>1 \ \ \mbox{(approximation)} \end{eqnarray*}\]

a = 2
b = 5
E = a/(a+b)
Mo = (a-1)/(a+b-2)
Me = (a-1/3)/(a+b-2/3)
c(E,Mo,Me)
## [1] 0.2857143 0.2000000 0.2631579

The Bernoulli likelihood and the Beta prior conjugate, which means that the posterior for \(\theta\) also follow a Beta distribution. The conjugacy property makes the Bayesian updating straightforward: \[\begin{eqnarray*} p(\theta|y_{1:n},a,b) &\equiv& p(\theta|n,s_n,a,b) = \frac{\left\{\frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}\theta^{a-1}(1-\theta)^{b-1}\right\} \theta^{s_n}(1-\theta)^{n-s_n}}{p(y_{1:n}|a,b)}\\ &\propto& \theta^{a+s_n-1}(1-\theta)^{b+n-s_n-1} \equiv \theta^{a_n-1}(1-\theta)^{b_n-1}, \end{eqnarray*}\] where \[ p(y_{1:n}|n,a,b)=\int_0^1 \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)} \theta^{a-1}(1-\theta)^{b-1} \theta^{s_n}(1-\theta)^{n-s_n}d\theta = \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)} \frac{\Gamma(a+s_n)\Gamma(b+n-s_n)}{\Gamma(a+b+n)}, \] is the normalizing constant, also known as marginal likelihood or prior predictive, or even simply predictive.

It is easy to see that the posterior distribution of \(\theta\) is Beta with hypeparameters \[ a_n = a + s_n \ \ \ \mbox{and} \ \ \ b_n = b + n - s_n. \] Therefore, \[\begin{eqnarray*} E(\theta|y_{1:n},a,b) &=& \frac{a+s_n}{a+b+n},\\ Mode(\theta|y_{1:n},a,b) &=& \frac{a+s_n-1}{a+b+n-1},\\ Median(\theta|y_{1:n},a,b) &=& \frac{a+s_n-1/3}{a+b+n-2/3}. \end{eqnarray*}\]

a1 = a+s
b1 = b+n-s
E1 = a1/(a1+b1)
Mo1 = (a1-1)/(a1+b1-2)
Me1 = (a1-1/3)/(a1+b1-2/3)
tab = rbind(c(E,Mo,Me),c(E1,Mo1,Me1))
colnames(tab) = c("Mean","Mode","Median")
rownames(tab) = c("Prior","Posterior")
round(tab,3)
##            Mean  Mode Median
## Prior     0.286 0.200  0.263
## Posterior 0.353 0.333  0.347
predictive1 = (gamma(a+b)/(gamma(a)*gamma(b)))*gamma(a1)*gamma(b1)/gamma(a1+b1)
predictive1
## [1] 0.0006243756

1.3 Output: Prior and posterior densities

theta = seq(0,1,length=1000)
par(mfrow=c(1,1))
plot(theta,dbeta(theta,a,b),type="l",lwd=2,ylim=c(0,4),xlab=expression(theta),ylab="Density")
lines(theta,dbeta(theta,a1,b1),col=2,lwd=2)
abline(v=E)
abline(v=Mo,lty=2)
abline(v=Me,lty=3)
abline(v=E1,col=2)
abline(v=Mo1,lty=2,col=2)
abline(v=Me1,lty=3,col=2)
abline(v=mle,col=3,lwd=2)
legend("topright",legend=c("Prior density","Posterior density"),col=1:2,lty=1,bty="n",lwd=2)
legend("right",legend=c("Mean","Mode","Median","MLE"),lty=c(1:3,1),col=c(1,1,1,3),bty="n",lwd=2)

2 Logit Bernoulli model

We now bring to our modeling exercise the covariate \(x_i\) with is also binary with \(x_i=0\) if the age of the respondent \(i\) is less than 30, while \(x_i=1\) if the age is greater than equal to 30. In this way, this more complex model is given by \[ y_i|x_i,\theta \sim Bernoulli(\theta_i), \] where \[ \log\left(\frac{\theta_i}{1-\theta_i}\right) = \alpha + \beta x_i, \] such that \[\begin{eqnarray*} \theta_i &=& \frac{1}{1+\exp\{-\alpha\}} = \pi_0, \qquad \mbox{when} \ \ x_i=0\\ \theta_i &=& \frac{1}{1+\exp\{-(\alpha+\beta)\}} = \pi_1, \qquad \mbox{when} \ \ x_i=1. \end{eqnarray*}\]

In words, \(\theta_i\) can take only two possible values, the probabilities \(\pi_0\) and \(\pi_1\), which are equivalent to the probability \(\theta\) from the iid Bernoulli model.

n  = 10
y  = c(0,0,0,1,0,1,0,1,0,1)
x  = c(1,0,0,1,0,0,0,1,0,1)
n0 = sum(x==0)
n1 = sum(x==1)
s0 = sum(y[x==0])
s1 = sum(y[x==1])
c(n0,n1,s0,s1)
## [1] 6 4 1 3

2.1 Posterior inference

Posterior inference for \(\pi_0\) and \(\pi_1\) (or \(\alpha,\beta\)) is unavailable in closed form. We will us Monte Carlo method (here we use sampling importance resampling, SIR) to be able to approximate the posterior distribution and compute posterior means, modes, medians and other posterior summaries as needed.

2.2 Prior on \((\alpha,\beta)\)

We will assume \(\alpha\) and \(\beta\) are, a priori, independent Gaussians: \[ p(\alpha,\beta)=p(\alpha|\mu_\alpha,\sigma_\alpha^2)p(\beta|\mu_\beta,\sigma_\beta^2), \] with hyperparameters \(\mu_\alpha=-0.7\), \(\mu_\beta=1.4\) and \(\sigma_\mu=\sigma_\beta=1\). The intuition is that these marginal priors induce prior means for \(\pi_0\) and \(\pi_1\) around the following values: \[ \frac{1}{1+\exp\{0.7\}} \approx 1/3 \qquad \mbox{and} \qquad \frac{1}{1+\exp\{0.7-1.4\}} \approx 2/3. \] Below we show draws from the induced prior on \((\pi_0,\pi_1)\).

M     = 10000
alpha = rnorm(M,-0.7,1)
beta  = rnorm(M,1.4,1)

# Induced prior on (pi0,pi1)
pi0 = 1/(1+exp(-alpha))
pi1 = 1/(1+exp(-alpha-beta))

par(mfrow=c(2,2))
plot(alpha,beta,xlab=expression(alpha),ylab=expression(beta))
boxplot(alpha,alpha+beta,outline=FALSE,names=c(expression(alpha),expression(alpha+beta)))
plot(pi0,pi1,xlim=c(0,1),ylim=c(0,1),xlab=expression(pi[0]),ylab=expression(pi[1]))
abline(0,1,col=2,lwd=2)
boxplot(1/(1+exp(-alpha)),1/(1+exp(-alpha-beta)),names=c("Age<30 (i=0)","Age>=30 (i=1)"),ylab=expression(pi[i]))

2.3 Monte Carlo-based posterior inference

Obviously, the posterior distribution of \((\alpha,\beta)\) is of no known form and analytically intractable: \[ p(\alpha,\beta|y_{1:n},n,\mu_\alpha,\sigma_\alpha,\mu_\beta,\sigma_\beta) \propto p(\alpha,\beta|\mu_\alpha,\sigma_\alpha,\mu_\beta,\sigma_\beta)\prod_{i=1}^n \theta_i^{y_i}(1-\theta_i)^{1-y_i}. \] Recall that \(\theta_i = 1/(1+\exp\{-(\alpha+\beta x_i)\})\), for \(i=1,\ldots,n\), so the likelihood is a function of \((\alpha,\beta)\).

As we will see throughout this course, Monte Carlo-based posterior inference has become the most common scheme to be able to accurately approximate highly complex posterior distributions. The the dimension of the parameter is relatively low (here we have \(\alpha\) and \(\beta\), so 2-dimensional) a simple strategy is to sample draws from the prior and resample those draws with weights proportional to the likelihood, which translates into the well-known sampling importance resampling (SIR) scheme. The course notes is full of examples illustrating the implementation, and drawbacks, of SIR-based posterior inference. For now, let us just assume the following piece of code does the magic:

like = rep(0,M)
for (i in 1:M)
  like[i] = pi0[i]^s0*(1-pi0[i])^(n0-s0)*pi1[i]^s1*(1-pi1[i])^(n1-s1)
predictive2 = mean(like)
ind = sample(1:M,size=M,replace=TRUE,prob=like)
pi0.post = pi0[ind]
pi1.post = pi1[ind]
alpha.post = alpha[ind]
beta.post = beta[ind]
predictive2
## [1] 0.002290277

The variables pi0.post and pi1.post are samples (draws) from the joint posterior distribution \(p(\alpha,\beta|y_{1:n},n,\mu_\alpha,\sigma_\alpha,\mu_\beta,\sigma_\beta)\). Similarly for alpha.post and beta.post.

2.4 Graphical output

par(mfrow=c(2,2))
plot(pi0,pi1,xlim=c(0,1),ylim=c(0,1),xlab=expression(pi[0]),ylab=expression(pi[1]))
points(pi0.post,pi1.post,col=2)
legend("bottomright",legend=c("Prior draws","Posterior draws"),col=1:2,pch=16,bty="n")

plot(alpha,beta,xlab=expression(alpha),ylab=expression(beta))
points(alpha.post,beta.post,col=2)
title(paste("Correlation=",round(cor(alpha.post,beta.post),2),sep=""))

plot(density(pi0),ylim=c(0,4),xlim=c(0,1),main="",xlab=expression(pi[0]))
lines(density(pi0.post),col=2)
legend("topright",legend=c(expression(x[i]==0)))

plot(density(pi1),ylim=c(0,3),xlab=expression(pi[1]),xlim=c(0,1),main="")
lines(density(pi1.post),col=2)
legend("topright",legend=c(expression(x[i]==1)))

3 Comparing the two Bernoulli models

par(mfrow=c(1,1))
plot(theta,dbeta(theta,a1,b1),type="l",lwd=2,ylim=c(0,4),
     xlab="Probability of sucess",ylab="Posterior density")
lines(density(pi0[ind]),col=2,lwd=2)
lines(density(pi1[ind]),col=3,lwd=2)
legend("topright",legend=c(expression(theta),expression(pi[0]),expression(pi[1])),lty=1,lwd=2,col=1:3,bty="n")

A few commonly used posterior summaries are as follows:

tab = rbind(
qbeta(c(0.05,0.5,0.95),a1,b1),
quantile(pi0[ind],c(0.05,0.5,0.95)),
quantile(pi1[ind],c(0.05,0.5,0.95)))
rownames(tab) = c("theta","pi0","pi1")
round(tab,2)
##         5%  50%  95%
## theta 0.18 0.35 0.55
## pi0   0.11 0.27 0.50
## pi1   0.36 0.68 0.90

3.1 Bayes factor

Let call the iid Bernoulli model \(M_1\), while the logit Bernoulli model \(M_2\). Then, the Bayes factor is defined as \[ BF_{12} = \frac{p(y_{1:n}|a,b)}{p(y_{1:n}|x_{1:n},\mu_\alpha,\sigma_\alpha,\mu_\beta,\sigma_\beta)} \approx \frac{p(y_{1:n}|a,b)}{p_{SIR}(y_{1:n}|x_{1:n},\mu_\alpha,\sigma_\alpha,\mu_\beta,\sigma_\beta)}, \] where \(p_{SIR}(y_{1:n}|x_{1:n},\mu_\alpha,\sigma_\alpha,\mu_\beta,\sigma_\beta)\) is the SIR-based approximatin to the prior predictive under the logit Bernoulli model. We will explain why that is correct later!

c(predictive1,predictive2)
## [1] 0.0006243756 0.0022902770
BF12 = predictive1/predictive2
BF21 = 1/BF12
BF21
## [1] 3.668108

In a review paper entitled Bayes Factors, that appeared in JASA in 1995 (Volume 90, Issue 430), Robert E. Kass and Adrian E. Raftery suggested the following rule of thumb in order to quantity evidence against \(M_1\):

  • \(B_{21} \in (1,3)\): Not worth more than a bare mention
  • \(B_{21} \in (3,20)\): Positive
  • \(B_{21} \in (20,150)\): Strong
  • \(B_{21} \in (150,\infty)\): Very strong.

Therefore, in our case, the data provides positive evidence against \(M_1\).

---
title: "Two Bernoulli models"
author: "Hedibert Freitas Lopes"
date: "8/27/2024"
output:
  html_document:
    toc: true
    toc_depth: 3
    toc_collapsed: true
    code_download: yes
    number_sections: true
---

# iid Bernoulli model
Here we assume that the data $y_{1:n}=\{y_1,\ldots,y_n\}$ is modeled as a random sample from $Bernoulli(\theta)$.  In our classroom example, the binary random variable $y_i=1$ if individual $i$ (attending the class) has three or more siblings, including himself or herself, with $y_i=0$, otherwise.  One of the main goals is to access the uncertainty around $\theta$, i.e. the posterior $p(\theta|y_{1:n})$, and then obtain $Pr(y_{n+1}=1|y_{1:n})$, the probability that the $(n+1)$ individual has three or more siblings, including himself or herself.

## Maximum likelihood estimation
From basic likelihood-based statistical inference, it is easy to see that
$$
{\widehat \theta}_{mle} \equiv \mbox{arg}\max_{\theta \in (0,1)} L(\theta;y_{1:n}),
$$
where the likelihood function $L(\theta;\mbox{data})$ is given by
$$
L(\theta;\mbox{data}) = \prod_{i=1}^n \theta^{y_i}(1-\theta)^{1-y_i} = \theta^{s_n}(1-\theta)^{n-s_n},
$$
where $s_n = \sum_{i=1}^n y_i$ is the number of sucesses out of $n$ (Bernoulli) trials.  We compute the MLE of $\theta$ based on $n=10$ observations and $y_{1:10}=\{0,0,0,1,0,1,0,1,0,1\}$, so $s_{10}=4 and ${\widehat \theta}_{mle}=4/10=0.4$.

```{r}
n = 10
y = c(0,0,0,1,0,1,0,1,0,1)
s = sum(y==1)

mle = s/n
c(n,s,mle)
```

## Prior, likelihood and posterior for $\theta$

We have elicited a prior for $\theta$ in class (with the help of Vitória) and arrived at the following 
Beta distribution:  $\theta \sim Beta(a,b)$ where $a=2$ and $b=6$, such that
\begin{eqnarray*}
E(\theta)      &=& \frac{a}{a+b}\\
Mode(\theta)   &=& \frac{a-1}{a+b-2}, \quad \qquad  a,b>1\\
Median(\theta) &=& \frac{a-1/3}{a+b-2/3}, \qquad a,b>1 \ \ \mbox{(approximation)}
\end{eqnarray*}

```{r fig.width=6, fig.height=10, fig.align='center'}
a = 2
b = 5
E = a/(a+b)
Mo = (a-1)/(a+b-2)
Me = (a-1/3)/(a+b-2/3)
c(E,Mo,Me)
```

The Bernoulli likelihood and the Beta prior conjugate, which means that the posterior for $\theta$ also follow a Beta distribution.  The conjugacy property makes the Bayesian updating straightforward:
\begin{eqnarray*}
p(\theta|y_{1:n},a,b) &\equiv& p(\theta|n,s_n,a,b) = 
\frac{\left\{\frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}\theta^{a-1}(1-\theta)^{b-1}\right\}
\theta^{s_n}(1-\theta)^{n-s_n}}{p(y_{1:n}|a,b)}\\
&\propto& \theta^{a+s_n-1}(1-\theta)^{b+n-s_n-1} \equiv \theta^{a_n-1}(1-\theta)^{b_n-1},
\end{eqnarray*}
where 
$$
p(y_{1:n}|n,a,b)=\int_0^1 \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}
\theta^{a-1}(1-\theta)^{b-1} \theta^{s_n}(1-\theta)^{n-s_n}d\theta = 
\frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}
\frac{\Gamma(a+s_n)\Gamma(b+n-s_n)}{\Gamma(a+b+n)},
$$
is the normalizing constant, also known as marginal likelihood or prior predictive, or even simply predictive.  

It is easy to see that the posterior distribution of $\theta$ is Beta with hypeparameters
$$
a_n = a + s_n \ \ \ \mbox{and} \ \ \ b_n = b + n - s_n.
$$
Therefore, 
\begin{eqnarray*}
E(\theta|y_{1:n},a,b)      &=& \frac{a+s_n}{a+b+n},\\
Mode(\theta|y_{1:n},a,b)   &=& \frac{a+s_n-1}{a+b+n-1},\\
Median(\theta|y_{1:n},a,b) &=& \frac{a+s_n-1/3}{a+b+n-2/3}.
\end{eqnarray*}

```{r}
a1 = a+s
b1 = b+n-s
E1 = a1/(a1+b1)
Mo1 = (a1-1)/(a1+b1-2)
Me1 = (a1-1/3)/(a1+b1-2/3)
tab = rbind(c(E,Mo,Me),c(E1,Mo1,Me1))
colnames(tab) = c("Mean","Mode","Median")
rownames(tab) = c("Prior","Posterior")
round(tab,3)

predictive1 = (gamma(a+b)/(gamma(a)*gamma(b)))*gamma(a1)*gamma(b1)/gamma(a1+b1)
predictive1
```

## Output: Prior and posterior densities
```{r fig.width=8, fig.height=6, fig.align='center'}
theta = seq(0,1,length=1000)
par(mfrow=c(1,1))
plot(theta,dbeta(theta,a,b),type="l",lwd=2,ylim=c(0,4),xlab=expression(theta),ylab="Density")
lines(theta,dbeta(theta,a1,b1),col=2,lwd=2)
abline(v=E)
abline(v=Mo,lty=2)
abline(v=Me,lty=3)
abline(v=E1,col=2)
abline(v=Mo1,lty=2,col=2)
abline(v=Me1,lty=3,col=2)
abline(v=mle,col=3,lwd=2)
legend("topright",legend=c("Prior density","Posterior density"),col=1:2,lty=1,bty="n",lwd=2)
legend("right",legend=c("Mean","Mode","Median","MLE"),lty=c(1:3,1),col=c(1,1,1,3),bty="n",lwd=2)
```

# Logit Bernoulli model
We now bring to our modeling exercise the covariate $x_i$ with is also binary with $x_i=0$ if the age of the respondent $i$ is less than 30, while $x_i=1$ if the age is greater than equal to 30.  In this way, this more complex model is given by
$$
y_i|x_i,\theta \sim Bernoulli(\theta_i),
$$
where 
$$
\log\left(\frac{\theta_i}{1-\theta_i}\right) = \alpha + \beta x_i,
$$
such that 
\begin{eqnarray*}
\theta_i &=& \frac{1}{1+\exp\{-\alpha\}} = \pi_0, \qquad \mbox{when} \ \ x_i=0\\
\theta_i &=& \frac{1}{1+\exp\{-(\alpha+\beta)\}} = \pi_1, \qquad \mbox{when} \ \ x_i=1.
\end{eqnarray*}

In words, $\theta_i$ can take only two possible values, the probabilities $\pi_0$ and $\pi_1$, which are equivalent to the probability $\theta$ from the iid Bernoulli model.

```{r}
n  = 10
y  = c(0,0,0,1,0,1,0,1,0,1)
x  = c(1,0,0,1,0,0,0,1,0,1)
n0 = sum(x==0)
n1 = sum(x==1)
s0 = sum(y[x==0])
s1 = sum(y[x==1])
c(n0,n1,s0,s1)
```

## Posterior inference
Posterior inference for $\pi_0$ and $\pi_1$ (or $\alpha,\beta$) is unavailable in closed form.  We will us Monte Carlo method (here we use sampling importance resampling, SIR) to be able to approximate the posterior distribution and compute posterior means, modes, medians and other posterior summaries as needed.

## Prior on $(\alpha,\beta)$

We will assume $\alpha$ and $\beta$ are, *a priori*, independent Gaussians: 
$$
p(\alpha,\beta)=p(\alpha|\mu_\alpha,\sigma_\alpha^2)p(\beta|\mu_\beta,\sigma_\beta^2),
$$
with hyperparameters $\mu_\alpha=-0.7$, $\mu_\beta=1.4$ and $\sigma_\mu=\sigma_\beta=1$.  The intuition is that these marginal priors induce prior means for $\pi_0$ and $\pi_1$ around the following values:
$$
\frac{1}{1+\exp\{0.7\}} \approx 1/3 \qquad \mbox{and} \qquad
\frac{1}{1+\exp\{0.7-1.4\}} \approx 2/3.
$$
Below we show draws from the induced prior on $(\pi_0,\pi_1)$.
```{r fig.width=8, fig.height=8, fig.align='center'}
M     = 10000
alpha = rnorm(M,-0.7,1)
beta  = rnorm(M,1.4,1)

# Induced prior on (pi0,pi1)
pi0 = 1/(1+exp(-alpha))
pi1 = 1/(1+exp(-alpha-beta))

par(mfrow=c(2,2))
plot(alpha,beta,xlab=expression(alpha),ylab=expression(beta))
boxplot(alpha,alpha+beta,outline=FALSE,names=c(expression(alpha),expression(alpha+beta)))
plot(pi0,pi1,xlim=c(0,1),ylim=c(0,1),xlab=expression(pi[0]),ylab=expression(pi[1]))
abline(0,1,col=2,lwd=2)
boxplot(1/(1+exp(-alpha)),1/(1+exp(-alpha-beta)),names=c("Age<30 (i=0)","Age>=30 (i=1)"),ylab=expression(pi[i]))
```

## Monte Carlo-based posterior inference
Obviously, the posterior distribution of $(\alpha,\beta)$ is of no known form and analytically intractable:
$$
p(\alpha,\beta|y_{1:n},n,\mu_\alpha,\sigma_\alpha,\mu_\beta,\sigma_\beta) \propto p(\alpha,\beta|\mu_\alpha,\sigma_\alpha,\mu_\beta,\sigma_\beta)\prod_{i=1}^n \theta_i^{y_i}(1-\theta_i)^{1-y_i}.
$$
Recall that $\theta_i = 1/(1+\exp\{-(\alpha+\beta x_i)\})$, for $i=1,\ldots,n$, so the likelihood is a function of $(\alpha,\beta)$.

As we will see throughout this course, Monte Carlo-based posterior inference has become the most common scheme to be able to accurately approximate highly complex posterior distributions.  The the dimension of the parameter is relatively low (here we have $\alpha$ and $\beta$, so 2-dimensional) a simple strategy is to sample draws from the prior and resample those draws with weights proportional to the likelihood, which translates into the well-known *sampling importance resampling (SIR)* scheme.  The course notes is full of examples illustrating the implementation, and drawbacks, of SIR-based posterior inference.  For now, let us just assume the following piece of code does the magic:

```{r}
like = rep(0,M)
for (i in 1:M)
  like[i] = pi0[i]^s0*(1-pi0[i])^(n0-s0)*pi1[i]^s1*(1-pi1[i])^(n1-s1)
predictive2 = mean(like)
ind = sample(1:M,size=M,replace=TRUE,prob=like)
pi0.post = pi0[ind]
pi1.post = pi1[ind]
alpha.post = alpha[ind]
beta.post = beta[ind]
predictive2
```

The variables pi0.post and pi1.post are samples (draws) from the joint posterior distribution $p(\alpha,\beta|y_{1:n},n,\mu_\alpha,\sigma_\alpha,\mu_\beta,\sigma_\beta)$.  Similarly for alpha.post and beta.post.

## Graphical output
```{r fig.width=8, fig.height=7, fig.align='center'}
par(mfrow=c(2,2))
plot(pi0,pi1,xlim=c(0,1),ylim=c(0,1),xlab=expression(pi[0]),ylab=expression(pi[1]))
points(pi0.post,pi1.post,col=2)
legend("bottomright",legend=c("Prior draws","Posterior draws"),col=1:2,pch=16,bty="n")

plot(alpha,beta,xlab=expression(alpha),ylab=expression(beta))
points(alpha.post,beta.post,col=2)
title(paste("Correlation=",round(cor(alpha.post,beta.post),2),sep=""))

plot(density(pi0),ylim=c(0,4),xlim=c(0,1),main="",xlab=expression(pi[0]))
lines(density(pi0.post),col=2)
legend("topright",legend=c(expression(x[i]==0)))

plot(density(pi1),ylim=c(0,3),xlab=expression(pi[1]),xlim=c(0,1),main="")
lines(density(pi1.post),col=2)
legend("topright",legend=c(expression(x[i]==1)))
```


# Comparing the two Bernoulli models

```{r fig.width=5, fig.height=5, fig.align='center'}
par(mfrow=c(1,1))
plot(theta,dbeta(theta,a1,b1),type="l",lwd=2,ylim=c(0,4),
     xlab="Probability of sucess",ylab="Posterior density")
lines(density(pi0[ind]),col=2,lwd=2)
lines(density(pi1[ind]),col=3,lwd=2)
legend("topright",legend=c(expression(theta),expression(pi[0]),expression(pi[1])),lty=1,lwd=2,col=1:3,bty="n")
```       

A few commonly used posterior summaries are as follows:
```{r}
tab = rbind(
qbeta(c(0.05,0.5,0.95),a1,b1),
quantile(pi0[ind],c(0.05,0.5,0.95)),
quantile(pi1[ind],c(0.05,0.5,0.95)))
rownames(tab) = c("theta","pi0","pi1")
round(tab,2)
```

## Bayes factor
Let call the iid Bernoulli model $M_1$, while the logit Bernoulli model $M_2$.  Then, the Bayes factor is defined as
$$
BF_{12} = \frac{p(y_{1:n}|a,b)}{p(y_{1:n}|x_{1:n},\mu_\alpha,\sigma_\alpha,\mu_\beta,\sigma_\beta)} \approx
\frac{p(y_{1:n}|a,b)}{p_{SIR}(y_{1:n}|x_{1:n},\mu_\alpha,\sigma_\alpha,\mu_\beta,\sigma_\beta)},
$$
where $p_{SIR}(y_{1:n}|x_{1:n},\mu_\alpha,\sigma_\alpha,\mu_\beta,\sigma_\beta)$ is the SIR-based approximatin to the prior predictive under the logit Bernoulli model.  We will explain why that is correct later!

```{r}
c(predictive1,predictive2)
BF12 = predictive1/predictive2
BF21 = 1/BF12
BF21
```

In a review paper entitled *Bayes Factors*, that appeared in JASA in 1995 (Volume 90, Issue 430), Robert E. Kass and Adrian E. Raftery suggested the following rule of thumb in order to quantity evidence against $M_1$:

* $B_{21} \in (1,3)$: Not worth more than a bare mention 
* $B_{21} \in (3,20)$: Positive 
* $B_{21} \in (20,150)$: Strong
* $B_{21} \in (150,\infty)$: Very strong.

Therefore, in our case, the data provides positive evidence against $M_1$.
