1 Boston housing data

Housing data for 506 census tracts of Boston from the 1970 census. The dataframe BostonHousing contains the original data by Harrison and Rubinfeld (1979), the dataframe BostonHousing2 the corrected version with additional spatial information (see references below). The original data are 506 observations on 14 variables, medv being the target variable.

The \(y\) variable is medv the median value of owner-occupied homes in USD 1000’s, while the \(x\) variables are

  1. rm: average number of rooms per dwelling
  2. lstat: percentage of lower status of the population
  3. crim: per capita crime rate by town
  4. age: proportion of owner-occupied units built prior to 1940
  5. tax: full-value property-tax rate per USD 10,000
  6. ptratio: pupil-teacher ratio by town

We will fit the standard Gaussian linear regression model, where \[ y_i|x_i,\beta,\sigma^2 \sim N(x_i'\beta,\sigma^2), \quad =1,\ldots,n, \] or, in matrix notation \[ y = X \beta + \varepsilon \qquad \varepsilon \sim N(0,\sigma^2 I_n). \]

#install.packages("mlbench")
library("mlbench")
data(BostonHousing2)
attach(BostonHousing2)

y = medv
X = cbind(1,rm,lstat,crim,age,tax,ptratio)
p = ncol(X)
n = length(y)

1.1 ML inference from scratch

The likelihood function is \[ L(\beta,\sigma^2|y,X) = (2\pi\sigma^2)^{-n/2}\exp\left\{-0.5(y-X\beta)'(y-X\beta)/\sigma^2\right\}, \] and the maximum likelihood estimates \({\hat \beta}_{MLE}\) and \({\hat \sigma}_{MLE}\) are obtained as \[ ({\hat \beta}_{MLE},{\hat \sigma}_{MLE}) = \mbox{arg}\max_{\beta,\sigma} L(\beta,\sigma^2|y,X), \] and are equal to \[ {\hat \beta}_{MLE} = (X'X)^{-1}X'y \ \ \mbox{and} \ \ {\hat \sigma}^2_{MLE} = (y-X{\hat \beta})'(y-X{\hat \beta})/n, \] where \(p\) is the number of columns in \(X\). These derivations are already available in the R package lm

beta.mle = solve(t(X)%*%X)%*%t(X)%*%y
y.hat    = X%*%beta.mle
e.hat    = y-y.hat
sig.mle  = sqrt(mean(e.hat^2))

table = round(rbind(cbind(beta.mle,lm(y~X-1)$coef),
              c(sig.mle,(n-p)/n*summary(lm(y~X-1))$sigma)),4)
colnames(table)      = c("Our function","lm function")
rownames(table)[1]   = "constant"
rownames(table)[p+1] = "sigma"
table
##          Our function lm function
## constant      17.2841     17.2841
## rm             4.4755      4.4755
## lstat         -0.5764     -0.5764
## crim          -0.0578     -0.0578
## age            0.0207      0.0207
## tax           -0.0017     -0.0017
## ptratio       -0.8724     -0.8724
## sigma          5.1659      5.1300
summary(lm(y~X-1))
## 
## Call:
## lm(formula = y ~ X - 1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.3896  -3.2349  -0.7797   1.8441  29.3208 
## 
## Coefficients:
##           Estimate Std. Error t value Pr(>|t|)    
## X        17.284103   3.999730   4.321 1.87e-05 ***
## Xrm       4.475544   0.436292  10.258  < 2e-16 ***
## Xlstat   -0.576396   0.054278 -10.619  < 2e-16 ***
## Xcrim    -0.057824   0.033905  -1.705   0.0887 .  
## Xage      0.020686   0.010905   1.897   0.0584 .  
## Xtax     -0.001699   0.001983  -0.857   0.3920    
## Xptratio -0.872446   0.124820  -6.990 8.87e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.202 on 499 degrees of freedom
## Multiple R-squared:  0.9549, Adjusted R-squared:  0.9543 
## F-statistic:  1510 on 7 and 499 DF,  p-value: < 2.2e-16

1.2 Bayesian inference from scratch

We need to specify the prior for \((\beta,\sigma^2)\): \[ \beta \sim N(b,V) \ \ \mbox{and} \ \ \sigma^2 \sim IG(c,d). \] The full posterior distribution \[ p(\beta,\sigma^2|y,X) \propto L(\beta,\sigma^2|y,X)p(\beta)p(\sigma^2), \] is intractable in closed form. Below we devise a simple Gibbs sampler for approximate Bayesian inference. The full conditional distributions are \[ (\beta|\sigma^2,X,y) \sim N(b_1,V_1), \] where \(V_1^{-1}=V^{-1}+X'X/\sigma^2\) and \(V_1^{-1}b_1=V^{-1}b+X'y/\sigma^2\) are both functios of \(\sigma^2\), and \[ (\sigma^2|\beta,X,y) \sim IG(c_1,d_1), \] where \(c_1=c+n/2\) and \(d_1=d+(y-X\beta)'(y-X\beta)/2\), which is a function of \(\beta\).

# Prior knowledge 
b = rep(0,p)
V = diag(100,p)
c = 5/2
d = 5/2

# Gibbs sampler
sigma2 = 1 
M0     = 1000
M      = 1000
thin   = 1
niter  = M0+M*thin
betas  = matrix(0,niter,p)
sigs   = rep(0,niter)
for (iter in 1:niter){
  V1 = solve(t(X)%*%X/sigma2+solve(V))
  b1 = V1%*%(t(X)%*%y/sigma2+solve(V)%*%b)
  beta = b1 + t(chol(V1))%*%rnorm(p)
  c1 = c+n/2
  d1 = d + sum((y-X%*%beta)^2)/2
  sigma2 = 1/rgamma(1,c1,d1)
  betas[iter,] = beta
  sigs[iter] = sqrt(sigma2)
}
ind = seq(M0+1,niter,by=thin)
betas = betas[ind,]
sigs = sigs[ind]

mean.beta = apply(betas,2,mean)
table = round(cbind(beta.mle,mean.beta),4)
colnames(table) = c("ML","Bayes")
rownames(table)[1] = "Constant"
table
##               ML   Bayes
## Constant 17.2841 14.8783
## rm        4.4755  4.7000
## lstat    -0.5764 -0.5644
## crim     -0.0578 -0.0596
## age       0.0207  0.0202
## tax      -0.0017 -0.0018
## ptratio  -0.8724 -0.8238

1.2.1 Graphical output

The trace plots are commonly used to informaly check the mixing of the draws in an MCMC scheme.

par(mfrow=c(2,4))
for (i in 1:p)
ts.plot(betas[,i],ylab="",main=paste("beta(",i-1,")",sep=""))
ts.plot(sigs,ylab="",main="sigma")

The autocorrelation functions are used to assess how correlated the chains are.

par(mfrow=c(2,4))
for (i in 1:p)
acf(betas[,i],main=paste("beta(",i-1,")",sep=""))
acf(sigs,ylab="",main="sigma")

Below are Monte Carlo approximations to the marginal posterior distributions.

par(mfrow=c(2,4))
for (i in 1:p){
  hist(betas[,i],xlab="",prob=TRUE,main=paste("beta(",i-1,")",sep=""))
  abline(v=0,col=2,lwd=2)
}
hist(sigs,xlab="",prob=TRUE,main="sigma")

1.3 Inference via R package bayeslm

Our package implements an efficient sampler for Gaussian Bayesian linear regression. The package uses elliptical slice sampler instead of regular Gibbs sampler. The function has several built-in priors and user can also provide their own prior function (written as a R function). It is based on our paper entitled Efficient sampling for Gaussian linear regression with arbitrary priors, Journal of Computational and Graphical Statistics, 2019, 28, 142-154, written with Richard Hahn and Jingyu He.You can download the paper here.

#install.packages("bayeslm")
library("bayeslm")
X = cbind(rm,lstat,crim,age,tax,ptratio)
bayesfit = bayeslm(y,X,icept=TRUE,prior="ridge")
## ridge prior 
## fixed running time 0.00212838
## sampling time 0.476943
summary(bayesfit)
## Average number of rejections before one acceptance : 
## 0.75195 
## Summary of beta draws 
##    based on 16000 valid draws (number of burn in is 4000) 
## Summary of Posterior draws 
##  Moments 
##              mean std dev eff sample size    
## intercept 17.2779   4.032            4938 ***
## rm         4.4592   0.442            3825 ***
## lstat     -0.5610   0.057            1696 ***
## crim      -0.0590   0.034            6832   *
## age        0.0183   0.011            2814   .
## tax       -0.0018   0.002            5604    
## ptratio   -0.8647   0.127            7787 ***
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Summary of sigma draws 
## Mean of standard deviation is  5.325903 
## S.d. of standard deviation samples is  0.1680139 
## Effective sample size of s.d. is  14283.68
mean.beta1 = apply(bayesfit$beta,2,mean)
table = round(cbind(mean.beta,mean.beta1),4)
colnames(table) = c("Bayes (ours)","bayesml")
rownames(table)[1] = "Constant"
table
##          Bayes (ours) bayesml
## Constant      14.8783 17.1918
## rm             4.7000  4.4687
## lstat         -0.5644 -0.5609
## crim          -0.0596 -0.0589
## age            0.0202  0.0184
## tax           -0.0018 -0.0019
## ptratio       -0.8238 -0.8635