In this exercise, we will perform approximate Bayesian model selection (prior predictive comparisons) vi Monte Carlo integration from an auxiliary/proposal distribution. The models we will entertain for observations, \(\mbox{data}=\{y_1,\ldots,y_n\}\), are the iid Gaussian model (\(M_1\)), the iid Student’s \(t_nu\) with known degrees of freedom, \(\nu\), and a model \((M_3)\), that splits the data into two submodels: \(M_1\) for observations \(t=1,\ldots,k\) and \(M_2\) for observations \(t=k+1,\ldots,n\). Both \(\nu\) and \(k\) will be considered indexing parameters and prior predictive evaluations are conditional in either and/or both of these quantities.
Before we proceed, let us take a quick overview of the Student’s \(t\) distribution.
A random variable \(y\) is Student’s \(t\) with location \(\mu\), scale \(\sigma^2\) and \(\nu\) degrees of freedom, denoted by \(t_\nu(\mu,\sigma^2)\), has mean and variance given by \[\begin{eqnarray*} E(y|\nu) &=& \mu, \qquad \mbox{for} \ \nu>1,\\ V(y|\nu) &=& \frac{\nu}{\nu-2}\sigma^2, \qquad \mbox{for} \ \nu>2. \end{eqnarray*}\] In what follows we will fix \(\mu=0\) and let \(\sigma^2=(\nu-2)/2\), so the Student’s \(t\) model we will consider have mean and variance equal to zero and one, respectively, for \(\nu>2\).
library("mvtnorm")
mus = seq(-5,5,length=1000)
par(mfrow=c(1,2))
plot(mus,dt(mus/sqrt(1/3),df=3)/sqrt(1/3),type="l",ylim=c(0,0.7),lwd=2,
xlab=expression(mu),ylab="Density")
for (i in 4:8)
lines(mus,dt(mus/sqrt((i-2)/i),df=i)/sqrt((i-2)/i),col=i,lwd=2)
plot(mus,dt(mus/sqrt(1/3),df=3)/sqrt(1/3),type="l",ylim=c(0,0.01),xlim=c(-5,-3),lwd=2,
xlab=expression(mu),ylab="Log density")
for (i in 4:8)
lines(mus,dt(mus/sqrt((i-2)/i),df=i)/sqrt((i-2)/i),col=i,lwd=2)
legend("topleft",legend=3:8,col=c(2,4:8),lwd=2,bty="n")
Notice that when \(\nu=100\), \(t_{100}(0,1)\) and the \(N(0,1)\) are virtually indistinguishable. If iid observations were collected daily, it would take, on average, two years to observe an event as extreme as -5 under the \(t_3\) (heavy tail!), while under the standard Gaussian it would take, on average, almost three centuries! What is the lesson? Well, ignoring heavy tail behavior might be extremely costly.
nu = c(3:10,30,100)
prob = pt(-5/sqrt((nu-2)/nu),df=nu)
tab = cbind(100*round(prob,6),trunc(1/prob),round(1/prob/365))
colnames(tab) = c("%<(-5)","days","years")
rownames(tab) = nu
tab
## %<(-5) days years
## 3 0.1620 617 2
## 4 0.1055 947 3
## 5 0.0664 1506 4
## 6 0.0433 2308 6
## 7 0.0295 3390 9
## 8 0.0209 4789 13
## 9 0.0153 6537 18
## 10 0.0115 8664 24
## 30 0.0007 140925 386
## 100 0.0001 1008085 2762
set.seed(13579)
n = 500
k = 200
mu = 0
sig = 1
nu = 3
tau = sqrt((nu-2)/nu)*sig
y = c(mu+sig*rnorm(k),mu+tau*rt(n-k,df=nu))
par(mfrow=c(1,1))
ts.plot(y)
abline(v=k,col=6)
abline(h=mu+sig,lty=2,col=2)
abline(h=mu+2*sig,lty=2,col=3)
abline(h=mu+3*sig,lty=2,col=4)
abline(h=mu,col=2)
abline(h=mu-sig,lty=2,col=2)
abline(h=mu-2*sig,lty=2,col=3)
abline(h=mu-3*sig,lty=2,col=4)
legend("topleft",legend=
c(paste("% out 1sd =",mean(abs(y[1:k])>(mu+sig))),
paste("% out 2sd =",mean(abs(y[1:k])>(mu+2*sig))),
paste("% out 3sd =",mean(abs(y[1:k])>(mu+3*sig)))))
legend("topright",legend=
c(paste("% out 1sd =",round(mean(abs(y[(k+1):n])>(mu+sig)),3)),
paste("% out 2sd =",round(mean(abs(y[(k+1):n])>(mu+2*sig)),3)),
paste("% out 3sd =",round(mean(abs(y[(k+1):n])>(mu+3*sig)),3))))
In models \(M_i\), for \(i=1,2,3\), the only unknown quantity is the location, which we will denote here by \(\theta\). The MLE estimator of \(\theta\) under the Gaussian model is simply the sample mean, i.e. \({\hat \theta}_{MLE}={\bar y}=\sum_{i=1}^n y_i/n\), which, under \(M_1\) and conditional in \(\theta\) follows a Gaussian distribution centered around \(\theta\) with variance given by \(\sigma^2/n\).
Therefore, we will make our proposal (or importance) density, \(q(\theta)\) the \(N({\bar y},\sigma^2/n)\), for all models we consider below.
set.seed(54321)
ybar = mean(y)
se = sqrt(sig^2/n)
M = 20000
theta.d = rnorm(M,ybar,se)
hist(theta.d,prob=T,xlab=expression(theta),ylab="Density",main="Draws from importance density")
We will assume here that \(y_i\) are iid \(N(\theta,\sigma^2)\) for known \(\sigma^2\). The prior for \(\theta\) will also be Gaussian, i.e. \(\theta \sim N(b_0,B_0)\) for known hyperparameters \(b_0\) and \(B_0\). Below we compute the exact value of the prior predictive, for \(y_{1:n}=(y_1,\ldots,y_n)'\), \[ p(y_{1:n}|M_1) = \int \prod_{i=1}^n p_N(y_i|\theta,\sigma^2)p_N(\theta|b_0,B_0)d\theta \] as well as its MC approximation, based on independent draws \(\{\theta^{(1)},\ldots,\theta^{(M)}\}\) from \(q(\theta)\): \[ {\hat p}_{MC}(y_{1:n}|M_1) = \frac{1}{M}\sum_{i=1}^M \frac{\prod_{i=1}^n p_N(y_i|\theta^{(i)},\sigma^2)p_N(\theta^{(i)}|b_0,B_0)}{q(\theta^{(i)})}, \] which is based on the identity \[ p(y_{1:n}|M_1) = \int \frac{\prod_{i=1}^n p_N(y_i|\theta,\sigma^2)p_N(\theta|b_0,B_0)}{q(\theta)} q(\theta)d\theta. \]
It is easy to verify that \(y_{1:n}|M_1 \sim N_n(b_0 1_n,B_01_n1_n' + \sigma^2 I_n)\), an \(n\)-variate Gaussian distribution. Here \(1_n\) is a \(n\)-dimensional column-vector of ones and \(I_n\) is the identity matrix of order \(n\).
b0 = 0
B0 = 9
w = rep(0,M)
for (i in 1:M)
w[i] = prod(dnorm(y,theta.d[i],sig))*dnorm(theta.d[i],b0,sqrt(B0))/dnorm(theta.d[i],ybar,se)
logpred.M1.sir = log(mean(w))
# Exact and MC-approximation of the prior predictive under model 1
V0 = matrix(B0,n,n)+diag(sig^2,n)
logpred.M1 = dmvnorm(y,rep(b0,n),V0,log=TRUE)
c(logpred.M1,logpred.M1.sir)
## [1] -660.5006 -660.5006
nus = c(3:30,seq(40,100,by=10),seq(200,1000,by=100))
nnu = length(nus)
logpred.M2.sir = rep(0,nnu)
for (j in 1:nnu){
nu = nus[j]
sig1 = sqrt((nu-2)/nu*sig^2)
for (i in 1:M)
w[i] = prod(dt((y-theta.d[i])/sig1,df=nu)/sig1)*
dnorm(theta.d[i],b0,sqrt(B0))/dnorm(theta.d[i],ybar,se)
logpred.M2.sir[j] = log(mean(w))
}
cbind(nus,logpred.M2.sir)
## nus logpred.M2.sir
## [1,] 3 -659.4134
## [2,] 4 -648.6910
## [3,] 5 -647.6866
## [4,] 6 -648.1098
## [5,] 7 -648.7964
## [6,] 8 -649.4940
## [7,] 9 -650.1410
## [8,] 10 -650.7257
## [9,] 11 -651.2507
## [10,] 12 -651.7224
## [11,] 13 -652.1474
## [12,] 14 -652.5321
## [13,] 15 -652.8817
## [14,] 16 -653.2008
## [15,] 17 -653.4933
## [16,] 18 -653.7624
## [17,] 19 -654.0109
## [18,] 20 -654.2411
## [19,] 21 -654.4550
## [20,] 22 -654.6542
## [21,] 23 -654.8404
## [22,] 24 -655.0147
## [23,] 25 -655.1783
## [24,] 26 -655.3322
## [25,] 27 -655.4772
## [26,] 28 -655.6141
## [27,] 29 -655.7436
## [28,] 30 -655.8662
## [29,] 40 -656.8120
## [30,] 50 -657.4340
## [31,] 60 -657.8752
## [32,] 70 -658.2048
## [33,] 80 -658.4605
## [34,] 90 -658.6648
## [35,] 100 -658.8318
## [36,] 200 -659.6260
## [37,] 300 -659.9078
## [38,] 400 -660.0523
## [39,] 500 -660.1401
## [40,] 600 -660.1992
## [41,] 700 -660.2416
## [42,] 800 -660.2735
## [43,] 900 -660.2985
## [44,] 1000 -660.3185
Graphically:
par(mfrow=c(1,1))
plot(log10(nus),logpred.M2.sir,xlab=expression(log10(nu)),
ylab="Log predictive",type="b",pch=16)
abline(h=logpred.M1.sir,col=2,lwd=2)
legend("topright",legend=c("Gaussian","Student's t(nu)"),col=2:1,pch=16,bty="n")
title(paste("'Best' nu =",nus[logpred.M2.sir==max(logpred.M2.sir)],sep=""))
In this model we assume that \[ y_1,\ldots,y_k \sim N(\theta,\sigma^2) \ \ \ \mbox{and} \ \ \ y_{k+1},\ldots,y_n \sim t_\nu(\theta,\frac{\nu-2}{\nu}\sigma^2), \] where \(\sigma^2\), \(\nu\) and \(k\) are known.
nus = 3:10
ks = seq(180,215,by=5)
nnu = length(nus)
nk = length(ks)
logpred.M3.sir = matrix(0,nk,nnu)
for (l in 1:nk){
k = ks[l]
for (j in 1:nnu){
nu = nus[j]
sig1 = sqrt((nu-2)/nu*sig^2)
for (i in 1:M)
w[i] = sum(y[1:k],theta.d[i],sig,log=TRUE)+
sum(dt((y[(k+1):n]-theta.d[i])/sig1,df=nu,log=TRUE))-(n-k)*log(sig1)+
dnorm(theta.d[i],b0,sqrt(B0),log=TRUE)-
dnorm(theta.d[i],ybar,se,log=TRUE)
A = max(w)
w1 = exp(w-A)
logpred.M3.sir[l,j] = log(sum(w1))+A
}
}
Here are the MC-based aprpoximations for the log prior predictives of model \(M_3\) for various combinations of \((k,\nu)\).
tab = logpred.M3.sir
rownames(tab)=ks
colnames(tab)=nus
round(tab,1)
## 3 4 5 6 7 8 9 10
## 180 -379.6 -380.2 -382.9 -385.1 -386.9 -388.2 -389.4 -390.3
## 185 -373.4 -374.2 -377.0 -379.2 -380.9 -382.3 -383.4 -384.4
## 190 -365.2 -366.7 -369.6 -372.0 -373.8 -375.2 -376.4 -377.4
## 195 -360.7 -361.6 -364.3 -366.5 -368.2 -369.5 -370.6 -371.5
## 200 -355.5 -356.8 -359.6 -361.8 -363.6 -364.9 -366.1 -367.0
## 205 -351.6 -352.4 -355.0 -357.2 -358.8 -360.1 -361.2 -362.1
## 210 -344.1 -344.8 -347.3 -349.4 -351.0 -352.3 -353.4 -354.2
## 215 -336.9 -337.5 -339.9 -341.9 -343.5 -344.8 -345.8 -346.7
The normalized prior predictives, or the posterior model probabilities of pair \((k,\nu)\) under a discrete uniform prior.
tab = round(exp(tab)/sum(exp(tab)),4)
rownames(tab)=ks
colnames(tab)=nus
tab
## 3 4 5 6 7 8 9 10
## 180 0.0000 0.0000 0.0000 0.0000 0e+00 0e+00 0e+00 0
## 185 0.0000 0.0000 0.0000 0.0000 0e+00 0e+00 0e+00 0
## 190 0.0000 0.0000 0.0000 0.0000 0e+00 0e+00 0e+00 0
## 195 0.0000 0.0000 0.0000 0.0000 0e+00 0e+00 0e+00 0
## 200 0.0000 0.0000 0.0000 0.0000 0e+00 0e+00 0e+00 0
## 205 0.0000 0.0000 0.0000 0.0000 0e+00 0e+00 0e+00 0
## 210 0.0004 0.0002 0.0000 0.0000 0e+00 0e+00 0e+00 0
## 215 0.6152 0.3479 0.0308 0.0042 9e-04 2e-04 1e-04 0