Introduction

In this exercise, we will perform approximate Bayesian model selection (prior predictive comparisons) vi Monte Carlo integration from an auxiliary/proposal distribution. The models we will entertain for observations, \(\mbox{data}=\{y_1,\ldots,y_n\}\), are the iid Gaussian model (\(M_1\)), the iid Student’s \(t_nu\) with known degrees of freedom, \(\nu\), and a model \((M_3)\), that splits the data into two submodels: \(M_1\) for observations \(t=1,\ldots,k\) and \(M_2\) for observations \(t=k+1,\ldots,n\). Both \(\nu\) and \(k\) will be considered indexing parameters and prior predictive evaluations are conditional in either and/or both of these quantities.

Before we proceed, let us take a quick overview of the Student’s \(t\) distribution.

A little bit about the Student’s t distribution

A random variable \(y\) is Student’s \(t\) with location \(\mu\), scale \(\sigma^2\) and \(\nu\) degrees of freedom, denoted by \(t_\nu(\mu,\sigma^2)\), has mean and variance given by \[\begin{eqnarray*} E(y|\nu) &=& \mu, \qquad \mbox{for} \ \nu>1,\\ V(y|\nu) &=& \frac{\nu}{\nu-2}\sigma^2, \qquad \mbox{for} \ \nu>2. \end{eqnarray*}\] In what follows we will fix \(\mu=0\) and let \(\sigma^2=(\nu-2)/2\), so the Student’s \(t\) model we will consider have mean and variance equal to zero and one, respectively, for \(\nu>2\).

library("mvtnorm")

mus = seq(-5,5,length=1000)
par(mfrow=c(1,2))
plot(mus,dt(mus/sqrt(1/3),df=3)/sqrt(1/3),type="l",ylim=c(0,0.7),lwd=2,
     xlab=expression(mu),ylab="Density")
for (i in 4:8) 
  lines(mus,dt(mus/sqrt((i-2)/i),df=i)/sqrt((i-2)/i),col=i,lwd=2)

plot(mus,dt(mus/sqrt(1/3),df=3)/sqrt(1/3),type="l",ylim=c(0,0.01),xlim=c(-5,-3),lwd=2,
          xlab=expression(mu),ylab="Log density")
for (i in 4:8) 
  lines(mus,dt(mus/sqrt((i-2)/i),df=i)/sqrt((i-2)/i),col=i,lwd=2)
legend("topleft",legend=3:8,col=c(2,4:8),lwd=2,bty="n")

Tail probabilities

Notice that when \(\nu=100\), \(t_{100}(0,1)\) and the \(N(0,1)\) are virtually indistinguishable. If iid observations were collected daily, it would take, on average, two years to observe an event as extreme as -5 under the \(t_3\) (heavy tail!), while under the standard Gaussian it would take, on average, almost three centuries! What is the lesson? Well, ignoring heavy tail behavior might be extremely costly.

nu = c(3:10,30,100)
prob = pt(-5/sqrt((nu-2)/nu),df=nu)
tab = cbind(100*round(prob,6),trunc(1/prob),round(1/prob/365))
colnames(tab) = c("%<(-5)","days","years")
rownames(tab) = nu
tab
##     %<(-5)    days years
## 3   0.1620     617     2
## 4   0.1055     947     3
## 5   0.0664    1506     4
## 6   0.0433    2308     6
## 7   0.0295    3390     9
## 8   0.0209    4789    13
## 9   0.0153    6537    18
## 10  0.0115    8664    24
## 30  0.0007  140925   386
## 100 0.0001 1008085  2762

Simulating some data

set.seed(13579)
n    = 500
k    = 200
mu   = 0
sig  = 1
nu   = 3
tau  = sqrt((nu-2)/nu)*sig
y    = c(mu+sig*rnorm(k),mu+tau*rt(n-k,df=nu))

par(mfrow=c(1,1))
ts.plot(y)
abline(v=k,col=6)
abline(h=mu+sig,lty=2,col=2)
abline(h=mu+2*sig,lty=2,col=3)
abline(h=mu+3*sig,lty=2,col=4)
abline(h=mu,col=2)
abline(h=mu-sig,lty=2,col=2)
abline(h=mu-2*sig,lty=2,col=3)
abline(h=mu-3*sig,lty=2,col=4)
legend("topleft",legend=
c(paste("% out 1sd =",mean(abs(y[1:k])>(mu+sig))),
paste("% out 2sd =",mean(abs(y[1:k])>(mu+2*sig))),
paste("% out 3sd =",mean(abs(y[1:k])>(mu+3*sig)))))

legend("topright",legend=
c(paste("% out 1sd =",round(mean(abs(y[(k+1):n])>(mu+sig)),3)),
paste("% out 2sd =",round(mean(abs(y[(k+1):n])>(mu+2*sig)),3)),
paste("% out 3sd =",round(mean(abs(y[(k+1):n])>(mu+3*sig)),3))))

MC proposal density

In models \(M_i\), for \(i=1,2,3\), the only unknown quantity is the location, which we will denote here by \(\theta\). The MLE estimator of \(\theta\) under the Gaussian model is simply the sample mean, i.e. \({\hat \theta}_{MLE}={\bar y}=\sum_{i=1}^n y_i/n\), which, under \(M_1\) and conditional in \(\theta\) follows a Gaussian distribution centered around \(\theta\) with variance given by \(\sigma^2/n\).

Therefore, we will make our proposal (or importance) density, \(q(\theta)\) the \(N({\bar y},\sigma^2/n)\), for all models we consider below.

set.seed(54321)
ybar    = mean(y)
se      = sqrt(sig^2/n)
M       = 20000
theta.d = rnorm(M,ybar,se)
hist(theta.d,prob=T,xlab=expression(theta),ylab="Density",main="Draws from importance density")

Model 1: \(N(\theta,\sigma^2)\)

We will assume here that \(y_i\) are iid \(N(\theta,\sigma^2)\) for known \(\sigma^2\). The prior for \(\theta\) will also be Gaussian, i.e. \(\theta \sim N(b_0,B_0)\) for known hyperparameters \(b_0\) and \(B_0\). Below we compute the exact value of the prior predictive, for \(y_{1:n}=(y_1,\ldots,y_n)'\), \[ p(y_{1:n}|M_1) = \int \prod_{i=1}^n p_N(y_i|\theta,\sigma^2)p_N(\theta|b_0,B_0)d\theta \] as well as its MC approximation, based on independent draws \(\{\theta^{(1)},\ldots,\theta^{(M)}\}\) from \(q(\theta)\): \[ {\hat p}_{MC}(y_{1:n}|M_1) = \frac{1}{M}\sum_{i=1}^M \frac{\prod_{i=1}^n p_N(y_i|\theta^{(i)},\sigma^2)p_N(\theta^{(i)}|b_0,B_0)}{q(\theta^{(i)})}, \] which is based on the identity \[ p(y_{1:n}|M_1) = \int \frac{\prod_{i=1}^n p_N(y_i|\theta,\sigma^2)p_N(\theta|b_0,B_0)}{q(\theta)} q(\theta)d\theta. \]

It is easy to verify that \(y_{1:n}|M_1 \sim N_n(b_0 1_n,B_01_n1_n' + \sigma^2 I_n)\), an \(n\)-variate Gaussian distribution. Here \(1_n\) is a \(n\)-dimensional column-vector of ones and \(I_n\) is the identity matrix of order \(n\).

b0 = 0
B0 = 9
w  = rep(0,M)
for (i in 1:M)
  w[i] = prod(dnorm(y,theta.d[i],sig))*dnorm(theta.d[i],b0,sqrt(B0))/dnorm(theta.d[i],ybar,se)
logpred.M1.sir = log(mean(w))

# Exact and MC-approximation of the prior predictive under model 1
V0 = matrix(B0,n,n)+diag(sig^2,n)
logpred.M1 = dmvnorm(y,rep(b0,n),V0,log=TRUE)
c(logpred.M1,logpred.M1.sir)
## [1] -660.5006 -660.5006

Model 2: \(t_\nu(\theta,\frac{\nu-2}{\nu}\sigma^2)\)

nus = c(3:30,seq(40,100,by=10),seq(200,1000,by=100))
nnu = length(nus)
logpred.M2.sir = rep(0,nnu)
for (j in 1:nnu){
  nu = nus[j]
  sig1 = sqrt((nu-2)/nu*sig^2)
  for (i in 1:M)
    w[i] = prod(dt((y-theta.d[i])/sig1,df=nu)/sig1)*
    dnorm(theta.d[i],b0,sqrt(B0))/dnorm(theta.d[i],ybar,se)
  logpred.M2.sir[j] = log(mean(w))
}
cbind(nus,logpred.M2.sir)
##        nus logpred.M2.sir
##  [1,]    3      -659.4134
##  [2,]    4      -648.6910
##  [3,]    5      -647.6866
##  [4,]    6      -648.1098
##  [5,]    7      -648.7964
##  [6,]    8      -649.4940
##  [7,]    9      -650.1410
##  [8,]   10      -650.7257
##  [9,]   11      -651.2507
## [10,]   12      -651.7224
## [11,]   13      -652.1474
## [12,]   14      -652.5321
## [13,]   15      -652.8817
## [14,]   16      -653.2008
## [15,]   17      -653.4933
## [16,]   18      -653.7624
## [17,]   19      -654.0109
## [18,]   20      -654.2411
## [19,]   21      -654.4550
## [20,]   22      -654.6542
## [21,]   23      -654.8404
## [22,]   24      -655.0147
## [23,]   25      -655.1783
## [24,]   26      -655.3322
## [25,]   27      -655.4772
## [26,]   28      -655.6141
## [27,]   29      -655.7436
## [28,]   30      -655.8662
## [29,]   40      -656.8120
## [30,]   50      -657.4340
## [31,]   60      -657.8752
## [32,]   70      -658.2048
## [33,]   80      -658.4605
## [34,]   90      -658.6648
## [35,]  100      -658.8318
## [36,]  200      -659.6260
## [37,]  300      -659.9078
## [38,]  400      -660.0523
## [39,]  500      -660.1401
## [40,]  600      -660.1992
## [41,]  700      -660.2416
## [42,]  800      -660.2735
## [43,]  900      -660.2985
## [44,] 1000      -660.3185

Graphically:

par(mfrow=c(1,1))
plot(log10(nus),logpred.M2.sir,xlab=expression(log10(nu)),
     ylab="Log predictive",type="b",pch=16)
abline(h=logpred.M1.sir,col=2,lwd=2)
legend("topright",legend=c("Gaussian","Student's t(nu)"),col=2:1,pch=16,bty="n")
title(paste("'Best' nu =",nus[logpred.M2.sir==max(logpred.M2.sir)],sep=""))

Model 3: Gaussian/Student’s \(t\)

In this model we assume that \[ y_1,\ldots,y_k \sim N(\theta,\sigma^2) \ \ \ \mbox{and} \ \ \ y_{k+1},\ldots,y_n \sim t_\nu(\theta,\frac{\nu-2}{\nu}\sigma^2), \] where \(\sigma^2\), \(\nu\) and \(k\) are known.

nus = 3:10
ks  = seq(180,215,by=5)
nnu = length(nus)
nk  = length(ks)
logpred.M3.sir = matrix(0,nk,nnu)
for (l in 1:nk){
  k = ks[l]
  for (j in 1:nnu){
    nu = nus[j]
    sig1 = sqrt((nu-2)/nu*sig^2)
    for (i in 1:M)
      w[i] = sum(y[1:k],theta.d[i],sig,log=TRUE)+
             sum(dt((y[(k+1):n]-theta.d[i])/sig1,df=nu,log=TRUE))-(n-k)*log(sig1)+
             dnorm(theta.d[i],b0,sqrt(B0),log=TRUE)-
             dnorm(theta.d[i],ybar,se,log=TRUE)
    A = max(w)
    w1 = exp(w-A)
    logpred.M3.sir[l,j] = log(sum(w1))+A
  }
}

Here are the MC-based aprpoximations for the log prior predictives of model \(M_3\) for various combinations of \((k,\nu)\).

tab = logpred.M3.sir
rownames(tab)=ks
colnames(tab)=nus
round(tab,1)
##          3      4      5      6      7      8      9     10
## 180 -379.6 -380.2 -382.9 -385.1 -386.9 -388.2 -389.4 -390.3
## 185 -373.4 -374.2 -377.0 -379.2 -380.9 -382.3 -383.4 -384.4
## 190 -365.2 -366.7 -369.6 -372.0 -373.8 -375.2 -376.4 -377.4
## 195 -360.7 -361.6 -364.3 -366.5 -368.2 -369.5 -370.6 -371.5
## 200 -355.5 -356.8 -359.6 -361.8 -363.6 -364.9 -366.1 -367.0
## 205 -351.6 -352.4 -355.0 -357.2 -358.8 -360.1 -361.2 -362.1
## 210 -344.1 -344.8 -347.3 -349.4 -351.0 -352.3 -353.4 -354.2
## 215 -336.9 -337.5 -339.9 -341.9 -343.5 -344.8 -345.8 -346.7

The normalized prior predictives, or the posterior model probabilities of pair \((k,\nu)\) under a discrete uniform prior.

tab = round(exp(tab)/sum(exp(tab)),4)
rownames(tab)=ks
colnames(tab)=nus
tab
##          3      4      5      6     7     8     9 10
## 180 0.0000 0.0000 0.0000 0.0000 0e+00 0e+00 0e+00  0
## 185 0.0000 0.0000 0.0000 0.0000 0e+00 0e+00 0e+00  0
## 190 0.0000 0.0000 0.0000 0.0000 0e+00 0e+00 0e+00  0
## 195 0.0000 0.0000 0.0000 0.0000 0e+00 0e+00 0e+00  0
## 200 0.0000 0.0000 0.0000 0.0000 0e+00 0e+00 0e+00  0
## 205 0.0000 0.0000 0.0000 0.0000 0e+00 0e+00 0e+00  0
## 210 0.0004 0.0002 0.0000 0.0000 0e+00 0e+00 0e+00  0
## 215 0.6152 0.3479 0.0308 0.0042 9e-04 2e-04 1e-04  0