Simulating the data

In this exercise, we start by simulating \(n=20\) pairs of observations \((y_i,x_i)\) from the following simple linear regression: \[ y_i \sim N(0.5x_i,1) \qquad \mbox{for} \ i=1,\ldots,n \] with \(x_i \sim N(0,1)\).

set.seed(12345)
n = 20
x = rnorm(n)
y = rnorm(n,0.5*x,1)
sumx = sum(x)
sumy = sum(y)
par(mfrow=c(1,1))
plot(x,y)
abline(0,0.5,lwd=2)

OLS estimates

Ordinary least squares estimates \({\hat \alpha}\) and \({\hat \beta}\) are obtained via the R function lm:

ols = lm(y~x)
summary(ols)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2195 -0.5951  0.1786  0.8269  2.1814 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   0.3723     0.2675   1.392   0.1809  
## x             0.9118     0.3276   2.783   0.0123 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.191 on 18 degrees of freedom
## Multiple R-squared:  0.3008, Adjusted R-squared:  0.262 
## F-statistic: 7.745 on 1 and 18 DF,  p-value: 0.01227
par(mfrow=c(1,1))
plot(x,y)
abline(0,0.5,lwd=2)
abline(ols$coef,col=2,lwd=2)
legend("topleft",legend=c("TRUE","OLS"),col=1:2,lty=1,lwd=2)

Bayesian inference

We assume that, a priori, \(\alpha\) and \(\beta\) are independent with \(\alpha \sim N(a_0,A_0)\) and \(\beta \sim U(0,1)\). We will use hyperparameters \(a_0=0\) and \(A_0=10\), so the prior \(p(\alpha,\beta)\) is fairly noninformative.

a0=0
A0=10
sdA0 = sqrt(A0)
niA0 = (n+1/A0)^(-1)
a0iA0=a0/A0
# Likelihood function
like = function(alpha,beta){
  prod(dnorm(y,alpha+beta*x,1))
}

Computing \(E(\alpha|\mbox{data})\)

Let data=\(\{y_1,\ldots,y_n,x_1,\ldots,x_n\}\), so the goal is to compute \[ I_1 = E(\alpha|\mbox{data}) = \frac{ \int_{-\infty}^\infty\int_0^1 \alpha p(\alpha,\beta)L(\alpha,\beta;\mbox{data})d\beta d\alpha}{\int_{-\infty}^\infty\int_0^1 p(\alpha,\beta)L(\alpha,\beta;\mbox{data})d\beta d\alpha}, \] where likelihood and prior funtions are given by \[ L(\alpha,\beta;\mbox{data}) = \prod_{i=1}^n f_N(y_i;\alpha+\beta x_i;1) \ \ \ \mbox{and} \ \ \ p(\alpha,\beta) = f_N(\alpha;a_0,A_0), \] for \(\alpha\) in the real line and \(\beta\) in \((0,1)\), with \(f_N(y|\mu,\sigma^2)\) the density of a Gaussian distribution with mean \(\mu\) and variance \(\sigma^2\) evaluated at the point \(y\). The above expectation (ratio of integrals) can also be written as \[ I_2 = \int_{-\infty}^{\infty} \alpha p(\alpha|\mbox{data})d\alpha \] or \[ I_3 = E\{E(\alpha|\beta,\mbox{data})|\mbox{data}\} = E\{g(\beta)|\mbox{data}\} = \int_0^1 g(\beta)p(\beta|\mbox{data})d\beta \] where \(g(\beta)=E(\alpha|\beta,\mbox{data})\). Of course \(I_1\), \(I_2\) and \(I_3\) are theorerically (mathematically) the same. We will approximate \(E(\alpha|\mbox{data})\) via via Monte Carlo integration (\(I_1\)), via sampling importance resampling (\(I_2\)) and via Raoblackwellization (\(I_3\)).
We will now explain how each one of these approximations are obtained.

Monte Carlo integration

If \(\{(\alpha,\beta)^{(i)}\}_{i=1}^M\) are iid draws from the joint prior \(p(\alpha,\beta)\). Then \(I_1\) can be approximated by \[ \widehat{I}_{mci} = \frac{\frac1M \sum_{i=1}^M \alpha^{(i)}L(\alpha^{(i)},\beta^{(i)};\mbox{data})}{\frac1M \sum_{i=1}^M L(\alpha^{(i)},\beta^{(i)};\mbox{data})}. \] ## Sampling importance resampling If, instead, \(\{(\alpha,\beta)^{(i)}\}_{i=1}^M\) are resampled with weights \(\{w^{(i)}\}_{i=1}^M\), where \(w^{(i)} \propto L(\alpha^{(i)},\beta^{(i)};\mbox{data})\), so the new set of draws is \(\{(\tilde{\alpha},\tilde{\beta})^{(i)}\}_{i=1}^N\) (with \(N \leq M\)), then \(I_2\) can be approximated by \[ \widehat{I}_{sir} = \frac{1}{N} \sum_{i=1}^N \tilde{\alpha}^{(i)}. \] We have already discussed in class that the variance of \(\widehat{I}_{mci}\) is smaller than or equal to the variance of \(\widehat{I}_{sir}\). We can improve the approximation even further by observign that, in \(I_3\), \(g(\beta)=E(\alpha|\beta,\mbox{data})\) is available in closed form.

Raoblackwellization

Assumning, momentarily, that \(beta\) is fixed, let us compute \(E(\alpha|\beta,\mbox{data})\). In this particular case, the regression is reduced to a simple problem of learning the mean of \(z\)s, where \(z_i=y_i-\beta x_i\), which are iid \(N(\alpha,1)\). Combining this new likelihood with the existing prior for \(\alpha\), leads to the following conditional posterior (as an exercise, prove this result): \[ \alpha|\beta,\mbox{data} \sim N(a_1(\beta),A_1), \] where \[ A_1^{-1}=A_0^{-1}+n \ \ \ \mbox{and} \ \ \ a_1(\beta) = \left(A_1 A_0^{-1}a_0 + A_1 n {\bar y}\right)-\left(A_1 n {\bar x}\right)\beta, \] so \[ E(\alpha|\beta,\mbox{data}) = c + d \beta, \] where \(c=A_1 A_0^{-1}a_0 + A_1 n{\bar y}\) and \(d=A_1 n {\bar x}\). Now, if \({\hat E}(\beta|\mbox{data})\) is an approximation of \(E(\beta|\mbox{data})\) (via MC integration or SIR), then a Raoblackwellized approximation to \(I_3\) is \[ \widehat{I}_{rb} = c + d {\hat E}(\beta|\mbox{data}). \]

Brute force MC integration

Below we run simple Monte Carlo integration based on prior sampling of size \(M\) equal to 1 million draws. We will use these approximations as exact values for \(E(\alpha|\mbox{data})\) and \(E(\beta|\mbox{data})\).

set.seed(54321)
M = 1000000
alphas = rnorm(M,a0,sdA0)
betas = runif(M)
likes= rep(0,M)
for (i in 1:M)
  likes[i] = like(alphas[i],betas[i])
Ealpha.mci0 = mean(alphas*likes)/mean(likes)
Ebeta.mci0 = mean(betas*likes)/mean(likes)

Ealpha.mci0
## [1] 0.3828912
Ebeta.mci0
## [1] 0.7455089

Monte Carlo exercise

In what follows we compare the Monte Carlo error (variances) of four approximations of \(E(\alpha|\mbox{data})\):

  1. \(\widehat{I}_{mci}\): Monte Carlo integration;

  2. \(\widehat{I}_{sir}\): Sampling importance resampling (SIR);

  3. \(\widehat{I}_{rb.mic}\): Raoblackwell with \(E(\beta|\mbox{data})\) via MC integration;

  4. \(\widehat{I}_{rb.sir}\): Raoblackwell with \(E(\beta|\mbox{data})\) via SIR.

As the results below reveal, based on \(100\) replications of the computations, both Raoblackwellized approximations, \(\widehat{I}_{rb.mic}\) and \(\widehat{I}_{rb.sir}\), are significantly better than both MCI-based and SIR-based approximations. The main lesson, hence, is always perform as many analytical derivations as possible before considering Monte Carlo approximations.

set.seed(13579)
replication = 100
estimators = array(0,c(2,replication,4))
Ms = c(100,1000)
for (k in 1:2){
  M = Ms[k]
  for (j in 1:replication){
    
    # Computing E(theta|x,y) via Monte Carlo integration based on prior draws
    alphas = rnorm(M,a0,sdA0)
    betas = runif(M)
    likes= rep(0,M)
    for (i in 1:M)
      likes[i] = like(alphas[i],betas[i])

    Ealpha.mci = mean(alphas*likes)/mean(likes)
    Ebeta.mci = mean(betas*likes)/mean(likes)

    # Computing E(theta|x,y) via SIR based on prior draws
    M1 = M
    ind = sample(1:M,size=M1,prob=likes,replace=TRUE)
    alphas1 = alphas[ind]
    betas1 = betas[ind]

    Ealpha.sir = mean(alphas1)
    Ebeta.sir = mean(betas1)

    # Computing E(theta|x,y) via closed form integration over p(beta|alpha,x,y)
    Ealpha.rb.mci = niA0*(sumy+a0iA0-sumx*Ebeta.mci)
    Ealpha.rb.sir = niA0*(sumy+a0iA0-sumx*Ebeta.sir)

    estimators[k,j,] = c(Ealpha.mci,Ealpha.sir,Ealpha.rb.mci,Ealpha.rb.sir)
  }
}
par(mfrow=c(1,4))
boxplot.matrix(t(estimators[,,1]),names=Ms,xlab="M",main="MCI",ylab="E(alpha|x,y)",
               outline=FALSE)
abline(h=Ealpha.mci0,col=2)
boxplot.matrix(t(estimators[,,2]),names=Ms,xlab="M",main="SIR",ylab="E(alpha|x,y)",
               outline=FALSE)
abline(h=Ealpha.mci0,col=2)
boxplot.matrix(t(estimators[,,3]),names=Ms,xlab="M",main="RB/MCI",ylab="E(alpha|x,y)",
               outline=FALSE)
abline(h=Ealpha.mci0,col=2)
boxplot.matrix(t(estimators[,,4]),names=Ms,xlab="M",main="RB/SIR",ylab="E(alpha|x,y)",
               outline=FALSE)
abline(h=Ealpha.mci0,col=2)

par(mfrow=c(1,2))
for (k in 1:2){
  boxplot.matrix(estimators[k,,],names=c("MCI","SIR","RB/MCI","RB/SIR"),ylab="E(alpha|x,y)",
  outline=FALSE)
  abline(h=Ealpha.mci0,col=2)
  title(paste("M=",Ms[k],sep=""))
}