- Introduction
- Simulating the data
- Computing marginal likelihood for \((\sigma^2,\tau^2)\)
- Sequentially learning \(\beta_1,\ldots,\beta_n\) given \(({\hat \sigma}^2,{\hat \tau}^2)\)
- Jointly learning \(\beta_1,\ldots,\beta_n,\sigma^2,\tau^2\)
- Final improvement: sampling \(\beta_1,\ldots,\beta_n\) jointly
Introduction
In this note, we will perform Bayesian inference in a simple linear dynamic regression model. More specificaly, for \(t=1,\ldots,n\), real-valued dependent variable, \(y_t\), is linearly related to regressor/predictor, \(x_t\), as \[
y_t|x_t,\sigma^2 \sim N(x_t \beta_t, \sigma^2),
\] where \(\sigma \in \Re^+\) is the observational standard deviationand \(\beta_t \in \Re\) the time-varying slope. The dynamics of the slope is a simple random walk: \[
\beta_t | \beta_{t-1},\tau^2 \sim N(\beta_{t-1},\tau^2),
\] with \(\tau \in \Re^+\) the evolutional standard deviation. The specification is completed with the posterior of \(\beta_0\) at time \(t=0\), \[
\beta_0|D_0 \sim N(m_0,C_0),
\] where \(D_0\) is the accumulated knowledge at time \(t=0\) and \(D_t=\{D_{t-1},y_t\}\), for \(t=1,\ldots,n\).
Simulating the data
The data we simulate below are basically three static linear regressions for each third of the data, i.e. \[
y_t|x_t \sim
\left\{
\begin{array}{cc}
N(\gamma_1 x_t,\sigma^2) & t=1,\ldots,n/3, \\
N(\gamma_2 x_t,\sigma^2) & t=n/3+1,\ldots,2n/3, \\
N(\gamma_3 x_t,\sigma^2) & t=2n/3+1,\ldots,n,
\end{array}
\right.
\]
set.seed(12345)
n = 300
gamma1 = 4
gamma2 = 1
gamma3 = -1
sig2 = 4
x = rnorm(n)
y = rep(0,n)
sig = sqrt(sig2)
y[1:(n/3)] = gamma1*x[1:(n/3)] + rnorm(n/3,0,sig)
y[(n/3+1):(2*n/3)] = gamma2*x[(n/3+1):(2*n/3)] + rnorm(n/3,0,sig)
y[(2*n/3+1):n] = gamma3*x[(2*n/3+1):n] + rnorm(n/3,0,sig)
par(mfrow=c(2,2))
plot(x,y,col=c(rep(1,n/3),rep(2,n/3),rep(3,n/3)),pch=16)
plot(x,y,pch=16)
ts.plot(y)
ts.plot(x)
Computing marginal likelihood for \((\sigma^2,\tau^2)\)
As we learned in class, the Kalman filter recursions produce the evaluation of \[
p(y_1,\ldots,y_n|\sigma^2,\tau^2) = \prod_{t=1}^n p(y_t|D_{t-1},\sigma^2,\tau^2),
\] where \((y_t|D_{t-1},\sigma^2,\tau^2) \sim N(f_t,Q_t)\), with \(f_t\) and \(Q_t\) from the following recursion. For \(t=1,\ldots,n\), \[\begin{eqnarray*}
\beta_{t-1}|D_{t-1},\sigma^2,\tau^2 &\sim& N(m_{t-1},C_{t-1})\\
\beta_t|D_{t-1},\sigma^2,\tau^2 &\sim& N(a_t,R_t)\\
y_t|D_{t-1},\sigma^2,\tau^2 &\sim& N(f_t,Q_t)\\
\beta_t|D_t,\sigma^2,\tau^2 &\sim& N(m_t,C_t),
\end{eqnarray*}\] where \(a_t=m_{t-1}\), \(R_t=C_{t-1}+\tau^2\), \(f_t=x_t a_t\), \(Q_t=x_t^2 R_t+\sigma^2\), and \(m_t = a_t + A_t(y_t-f_t)\) and \(C_t=R_t-A_t^2Q_t\), for \(A_t=R_tx_t/Q_t\).
Therefore, the likelihood function of \((\sigma^2,\tau^2)\), integrating out \(\beta_1,\ldots,\beta_n\), is easily obtained \[
L(\sigma^2,\tau^2|D_n) = \prod_{t=1}^n p_N(y_t|f_t,Q_t).
\] Therefore, \[
({\hat \sigma}^2,{\hat \tau}^2) = \mbox{arg} \max_{(\sigma^2,\tau^2) \in \Re^{+2}} L(\sigma^2,\tau^2|D_n).
\]
m0 = 0
C0 = 1
N = 50
sig2s = seq(3,5,length=N)
tau2s = seq(0.01,0.2,length=N)
loglikelihood = matrix(0,N,N)
for (i in 1:N){
sig2 = sig2s[i]
for (j in 1:N){
m = m0
C = C0
tau2 = tau2s[j]
loglike = 0
for (t in 1:n){
a = m
R = C + tau2
ft = x[t]*a
Qt = x[t]^2*R + sig2
loglike = loglike + dnorm(y[t],ft,sqrt(Qt),log=TRUE)
At = R*x[t]/Qt
m = a + At*(y[t]-ft)
C = R - At^2*Qt
}
loglikelihood[i,j] = loglike
}
}
sig2.hat = sig2s[apply(loglikelihood==max(loglikelihood),1,sum)==1]
tau2.hat = tau2s[apply(loglikelihood==max(loglikelihood),2,sum)==1]
par(mfrow=c(1,1))
contour(sig2s,tau2s,exp(loglikelihood),
xlab=expression(sigma^2),ylab=expression(tau^2),drawlabels=FALSE)
title("Likelihood function")
points(sig2.hat,tau2.hat,col=2,pch=16)
Sequentially learning \(\beta_1,\ldots,\beta_n\) given \(({\hat \sigma}^2,{\hat \tau}^2)\)
We now rerun the kalman filter recursions conditionally on \(({\hat \sigma}^2,{\hat \tau}^2)\). More precisely, we can derive the sequential posteriors \[
p(\beta_t|D_t,{\hat \sigma}^2,{\hat \tau}^2) \equiv N(m_t,C_t),
\] where \(m_t\) and \(C_t\) are intricate functions of \(({\hat \sigma}^2,{\hat \tau}^2)\). However, it is worth remembering that we are actually using the data twice. First, we run the Kalman recursions several times and for the whole data, i.e. \(\{y_1,\ldots,y_n\}\), in order to obtain \(({\hat \sigma}^2,{\hat \tau}^2)\). Second, we rerun the Kalman recursions ``pretending’’ that \(\sigma={\hat \sigma}\) and \(\tau={\hat \tau}\).
m0 = 0
C0 = 1
sig2 = sig2.hat
tau2 = tau2.hat
mf = rep(0,n)
Cf = rep(0,n)
for (t in 1:n){
a = m
R = C + tau2
ft = x[t]*a
Qt = x[t]^2*R + sig2
loglike = loglike + dnorm(y[t],ft,sqrt(Qt),log=TRUE)
At = R*x[t]/Qt
m = a + At*(y[t]-ft)
C = R - At^2*Qt
mf[t] = m
Cf[t] = C
}
qf = cbind(mf-2*sqrt(Cf),mf+2*sqrt(Cf))
ts.plot(mf,ylim=range(qf),ylab="States")
lines(qf[,1],col=2)
lines(qf[,2],col=2)
segments(1,gamma1,n/3,gamma1,col=4)
segments(n/3+1,gamma2,2*n/3,gamma2,col=4)
segments(2*n/3+1,gamma3,n,gamma3,col=4)
abline(v=n/3,lty=2,col=6)
abline(v=2*n/3,lty=2,col=6)
Jointly learning \(\beta_1,\ldots,\beta_n,\sigma^2,\tau^2\)
At this point, we still will not be able to fully describe \(p(\beta_t|D_t)\), unless we repeat the previous two parts for each time \(t\). If the interest is in learning about \(\beta_1,\ldots,\beta_n\) and \((\sigma^2,\tau^2)\) jointly and conditionally on \(D_n\), i.e. the whole data, then we can easily implement an MCMC scheme that cycles throught the full conditionals: \(p(\beta_t|\beta_{-t},\sigma^2,\tau^2,y_{1:n},x_{1:n})\),for \(t=1,\ldots,n\), \(p(\sigma^2|\beta_{1:n},\tau^2,y_{1:n},x_{1:n})\),and \(p(\tau^2|\beta_{1:n},\sigma^2,y_{1:n},x_{1:n})\).
Full conditional of \(\sigma^2\)
Let us assume that \(\sigma^2 \sim IG(a_0,b_0)\), so \[
\sigma^2|\beta_{1:n},y_{1:n},x_{1:n} \sim IG\left(a_0+\frac{n}{2},b_0+\frac{\sum_{t=1}^n (y_t-\beta_t x_t)^2}{2}\right).
\]
Full conditional of \(\tau^2\)
Let us assume that \(\tau^2 \sim IG(c_0,d_0)\), so \[
\tau^2|\beta_{1:n} \sim IG\left(c_0+\frac{n-1}{2},d_0+\frac{\sum_{t=2}^n (\beta_t-\beta_{t-1})^2}{2}\right).
\]
Full conditional of \(\beta_t\)
For \(t=1\) \[
p(\beta_1|\beta_{-1},y_{1:n},x_{1:n},\sigma^2,\tau^2) =
p(\beta_1)p(\beta_2|\beta_1,\tau^2)
p(y_1|\beta_1,x_1,\sigma^2) \equiv N(b_1,V_1),
\] where \(V_1 = 1/(1/(C_0+\tau^2)+1/\tau^2+x_1^2/\sigma^2)\) and \(b_1 = V_1 (m_0/(C_0+\tau^2)+\beta_2/\tau^2+x_1y_1/\sigma^2)\). Recall that \(\beta_0 \sim N(m_0,C_0)\), so the prior \(\beta_1 \sim N(m_0,C_0+\tau^2)\).
It is straighforward to see that, for \(t=2,\ldots,n-1\), \[
p(\beta_t|\beta_{-t},y_{1:n},x_{1:n},\sigma^2,\tau^2) =
p(\beta_t|\beta_{t-1},\tau^2)
p(\beta_{t+1}|\beta_t,\tau^2)
p(y_t|\beta_t,x_t,\sigma^2) \equiv N(b_t,V_t),
\] where \(V_t = 1/(2/\tau^2+x_t^2/\sigma^2)\) and \(b_t=V_t((\beta_{t-1}+\beta_{t+1})/\tau^2+x_ty_t/\sigma^2)\).
For \(t=n\) \[
p(\beta_n|\beta_{-n},y_{1:n},x_{1:n},\sigma^2,\tau^2) =
p(\beta_n|\beta_{n-1},\tau^2)
p(y_n|\beta_n,x_n,\sigma^2) \equiv N(b_n,V_n),
\] where \(V_n = 1/(1/\tau^2+x_n^2/\sigma^2)\) and \(b_n = V_n (\beta_{n-1}/\tau^2+x_ny_n/\sigma^2)\).
a0=2;b0=1;c0=2;d0=1
a0=0.01;b0=0.01;c0=0.01;d0=0.01
# Initial values
beta = mf
sig2 = sig2.hat
tau2 = tau2.hat
M0 = 20000
M = 5000
niter = M0 + M
draws = matrix(0,niter,n+2)
for (iter in 1:niter){
sig2 = 1/rgamma(1,a0+n/2,b0+sum((y-beta*x)^2)/2)
tau2 = 1/rgamma(1,c0+(n-1)/2,d0+sum((beta[2:n]-beta[1:(n-1)])^2)/2)
V1 = 1/(1/(C0+tau2)+1/tau2+x[1]^2/sig2)
b1 = V1*(m0/(C0+tau2)+beta[2]/tau2+x[1]*y[1]/sig2)
beta[1] = rnorm(1,b1,sqrt(V1))
for (t in 2:(n-1)){
Vt = 1/(2/tau2+x[t]^2/sig2)
bt = Vt*((beta[t-1]+beta[t+1])/tau2+x[t]*y[t]/sig2)
beta[t] = rnorm(1,bt,sqrt(Vt))
}
Vn = 1/(1/tau2+x[n]^2/sig2)
bn = Vn*(beta[n-1]/tau2+x[n]*y[n]/sig2)
beta[n] = rnorm(1,bn,sqrt(Vn))
draws[iter,] = c(beta,sig2,tau2)
}
draws = draws[(M0+1):niter,]
Trace plots
par(mfrow=c(2,3))
ts.plot(draws[,1],main=expression(beta[1]),xlab="iterations",ylab="")
ts.plot(draws[,100],main=expression(beta[100]),xlab="iterations",ylab="")
ts.plot(draws[,200],main=expression(beta[200]),xlab="iterations",ylab="")
ts.plot(draws[,n],main=expression(beta[n]),xlab="iterations",ylab="")
ts.plot(draws[,n+1],main=expression(sigma^2),xlab="iterations",ylab="")
ts.plot(draws[,n+2],main=expression(tau^2),xlab="iterations",ylab="")
Markov chain autocorrelations
par(mfrow=c(1,3))
acf(draws[,n+1],main=expression(sigma^2),ylab="")
acf(draws[,n+2],main=expression(tau^2),ylab="")
acfs = matrix(0,n,50)
for (i in 1:n){
acf = acf(draws[,i],lag.max=50,plot=FALSE)
acfs[i,] = acf$acf[2:51]
}
plot(acfs[1,],ylim=range(acfs),ylab="ACF",type="l",xlab="Lag")
for (i in 2:n)
lines(acfs[i,])
Marginal posterior densities
Below we plot histogram approximations of marginal posterior densities for various parameters, such as \(p(\beta_1|D_n)\), \(p(\beta_200|D_n)\), \(p(\sigma^2|D_n)\) and \(p(\tau^2|D_n)\).
par(mfrow=c(2,3))
hist(draws[,1],main=expression(beta[1]),prob=TRUE,xlab="")
hist(draws[,100],main=expression(beta[100]),prob=TRUE,xlab="")
hist(draws[,200],main=expression(beta[200]),prob=TRUE,xlab="")
hist(draws[,n],main=expression(beta[n]),prob=TRUE,xlab="")
hist(draws[,n+1],main=expression(sigma^2),prob=TRUE,xlab="")
hist(draws[,n+2],main=expression(tau^2),prob=TRUE,xlab="")
Joint posterior \(p(\sigma^2,\tau^2|D_n)\)
The picture above shows the contours of \(p(\sigma^2,\tau^2|D_n)\) evaluated on a fine grid. The underneath image is a 2D-histogram reconstruction this joint distribution based on the posterior draws of \(\sigma^2\) and \(\tau^2\).
ldigamma=function(x,a,b){dgamma(1/x,a,b,log=TRUE)-2*log(x)}
logprior = matrix(0,N,N)
for (i in 1:N)
for (j in 1:N)
logprior[i,j] = ldigamma(sig2s[i],a0,b0)+ldigamma(tau2s[j],c0,d0)
logposterior = logprior+loglikelihood
library(MASS)
library(RColorBrewer)
rf = colorRampPalette(rev(brewer.pal(11,'Spectral')))
r = rf(32)
xx = draws[,n+1]
yy = draws[,n+2]
df = data.frame(xx,yy)
k = kde2d(df$xx, df$yy, n=200)
sig2.mode = k$x[apply(k$z==max(k$z),1,sum)==1]
tau2.mode = k$y[apply(k$z==max(k$z),2,sum)==1]
image(k, col=r,xlab=expression(sigma^2),ylab=expression(tau^2))
points(sig2.mode,tau2.mode,col=4,pch=16)
points(sig2.hat,tau2.hat,col=3,pch=16)
contour(sig2s,tau2s,exp(logposterior),add=TRUE,drawlabels=FALSE)
Below we show the values of \(\sigma^2\) and \(\tau^2\) that maximize the likelihood funtion and the joint posterior, respectively. Because we are using a prior that is fairly uninformative, both estimators are particularly similar.
c(sig2.hat,sig2.mode)
## [1] 3.897959 3.852122
c(tau2.hat,tau2.mode)
## [1] 0.04877551 0.04217180
Smoothed states marginal posterior densities
Below we plot the 2.5th, 50th and 97.5th percentiles of \[
p(\beta_t|D_n) \ \ \ \mbox{and} \ \ \ p(\beta_t|D_t,{\hat \sigma}^2,{\hat \tau}^2).
\] We ares still a bit short of producing \(p(\beta_t|D_t)\), i.e. the online distribution of the states unconditionally on \((\sigma^2,\tau^2)\). We will be able to do that once we talk about sequential Monte Carlo tools, also known as particle filters.
qbeta = t(apply(draws[,1:n],2,quantile,c(0.025,0.5,0.975)))
range = range(qbeta,qf)
par(mfrow=c(1,1))
ts.plot(qbeta,lty=c(2,1,2),ylim=range,xlab="Time",ylab=expression(beta),main="")
lines(qf[,1],col=2,lty=2)
lines(mf,col=2,lty=1)
lines(qf[,2],col=2,lty=2)
segments(1,gamma1,n/3,gamma1,col=4)
segments(n/3+1,gamma2,2*n/3,gamma2,col=4)
segments(2*n/3+1,gamma3,n,gamma3,col=4)
abline(v=n/3,lty=2,col=6)
abline(v=2*n/3,lty=2,col=6)
legend("topright",legend=c("Smoothed densities (MCMC)","Filtered densities (given sig2,tau2)"),col=1:2,lty=1)
Final improvement: sampling \(\beta_1,\ldots,\beta_n\) jointly
Instead of sampling \(\beta_t\) from its full conditional, the forward filtering, backward sampling scheme samples \(\beta_1,\ldots,\beta_n\) jointly. More specifically, FFBS uses the following results: \[
p(\beta_1,\ldots,\beta_n|\sigma^2,\tau^2,D_n) = p(\beta_n|\sigma^2,\tau^2,D_n)
\prod_{t=2}^n p(\beta_t|\beta_{t+1},\sigma^2,\tau^2,D_n),
\] where \((\beta_n|\sigma^2,\tau^2,D_n) \sim N({\tilde m}_n,{\tilde C}_n)\), with \({\tilde m}_n=m_n\) and \({\tilde C}_n)=C_n\). Also, \((\beta_t|\beta_{t+1},\sigma^2,\tau^2,D_n) \sim N({\tilde m}_n,{\tilde C}_n)\), where \[
{\tilde C}_t = 1/(1/\tau^2+1/C_t) \ \ \mbox{and} \ \
{\tilde m}_t = C_t(\beta_{t+1}/\tau^2 + m_t/C_t),
\] for \(t=n-1,\ldots,1\).
# MCMC scheme
beta = mf
sig2 = sig2.hat
tau2 = tau2.hat
M0 = 20000
M = 5000
niter = M0 + M
ms = rep(0,n)
Cs = rep(0,n)
m = m0
C = C0
draws1 = matrix(0,niter,n+2)
for (iter in 1:niter){
for (t in 1:n){
at = m
Rt = C+tau2
ft = x[t]*at
Qt = x[t]^2*Rt+sig2
At = Rt*x[t]/Qt
m = at + At*(y[t]-ft)
C = Rt - At^2*Qt
ms[t] = m
Cs[t] = C
}
betas = rep(0,n)
betas[n] = rnorm(1,ms[n],sqrt(Cs[n]))
for (t in (n-1):1){
var = 1/(1/tau2+1/Cs[t])
mean = var*(betas[t+1]/tau2+ms[t]/Cs[t])
betas[t] = rnorm(1,mean,sqrt(var))
}
# sampling sig2
sig2 = 1/rgamma(1,a0+n/2,b0+sum((y-x*betas)^2)/2)
# sampling tau2
tau2 = 1/rgamma(1,c0+(n-1)/2,d0+sum((betas[2:n]-betas[1:(n-1)])^2)/2)
# storing draws
draws1[iter,] = c(betas,sig2,tau2)
}
draws1 = draws1[(M0+1):niter,]
qbeta1 = t(apply(draws1[,1:n],2,quantile,c(0.025,0.5,0.975)))
par(mfrow=c(2,3))
plot(density(draws[,1]),main=expression(beta[1]),xlab="")
lines(density(draws1[,1]),col=2)
plot(density(draws[,100]),main=expression(beta[100]),xlab="")
lines(density(draws1[,100]),col=2)
plot(density(draws[,200]),main=expression(beta[200]),xlab="")
lines(density(draws1[,200]),col=2)
plot(density(draws[,n]),main=expression(beta[n]),xlab="")
lines(density(draws1[,n]),col=2)
plot(density(draws[,n+1]),main=expression(sigma^2),xlab="")
lines(density(draws1[,n+1]),col=2)
plot(density(draws[,n+2]),main=expression(tau^2),xlab="")
lines(density(draws1[,n+2]),col=2)
\(\beta_t\) autocorrelations
par(mfrow=c(1,2))
acfs1 = matrix(0,n,50)
for (i in 1:n){
acf = acf(draws1[,i],lag.max=50,plot=FALSE)
acfs1[i,] = acf$acf[2:51]
}
plot(acfs[1,],ylim=range(acfs,acfs1),ylab="ACF",type="l",xlab="Lag")
for (i in 2:n)
lines(acfs[i,])
plot(acfs1[1,],ylim=range(acfs,acfs1),ylab="ACF",type="l",xlab="Lag")
for (i in 2:n)
lines(acfs1[i,])
range = range(qbeta,qbeta1)
par(mfrow=c(1,1))
ts.plot(qbeta,lty=c(2,1,2),ylim=range,xlab="Time",ylab=expression(beta),main="")
lines(qbeta1[,1],col=2,lty=2)
lines(qbeta1[,2],col=2,lty=1)
lines(qbeta1[,3],col=2,lty=2)
segments(1,gamma1,n/3,gamma1,col=4)
segments(n/3+1,gamma2,2*n/3,gamma2,col=4)
segments(2*n/3+1,gamma3,n,gamma3,col=4)
abline(v=n/3,lty=2,col=6)
abline(v=2*n/3,lty=2,col=6)
legend("topright",legend=c("Block sampling","Individual sampling"),col=1:2,lty=1)
---
title: "Normal Dynamic Linear regression"
subtitle: "Bayesian inference"
author: "Hedibert Freitas Lopes"
date: "3/31/2022"
output:
  html_document:
    toc: true
    toc_depth: 2
    code_download: yes
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Introduction
In this note, we will perform Bayesian inference in a simple linear dynamic regression model.  More specificaly,  for $t=1,\ldots,n$, real-valued dependent variable, $y_t$, is linearly related to regressor/predictor, $x_t$, as
$$
y_t|x_t,\sigma^2 \sim N(x_t \beta_t, \sigma^2),
$$
where $\sigma \in \Re^+$ is the observational standard deviationand $\beta_t \in \Re$ the time-varying slope.  The dynamics of the slope is a simple random walk:
$$
\beta_t | \beta_{t-1},\tau^2 \sim N(\beta_{t-1},\tau^2),
$$
with $\tau \in \Re^+$ the evolutional standard deviation.  The specification is completed with the posterior of $\beta_0$ at time $t=0$, 
$$
\beta_0|D_0 \sim N(m_0,C_0),
$$
where $D_0$ is the accumulated knowledge at time $t=0$ and $D_t=\{D_{t-1},y_t\}$, for $t=1,\ldots,n$.

# Simulating the data
The data we simulate below are basically three static linear regressions for each third of the data, i.e.
$$
y_t|x_t \sim 
\left\{
\begin{array}{cc}
N(\gamma_1 x_t,\sigma^2) & t=1,\ldots,n/3, \\
N(\gamma_2 x_t,\sigma^2) & t=n/3+1,\ldots,2n/3, \\
N(\gamma_3 x_t,\sigma^2) & t=2n/3+1,\ldots,n,
\end{array}
\right.
$$
```{r fig.width=12, fig.height=10}
set.seed(12345)
n      = 300
gamma1 = 4
gamma2 = 1
gamma3 = -1
sig2   = 4
x      = rnorm(n)
y      = rep(0,n)
sig    = sqrt(sig2)
y[1:(n/3)]         = gamma1*x[1:(n/3)]         + rnorm(n/3,0,sig)
y[(n/3+1):(2*n/3)] = gamma2*x[(n/3+1):(2*n/3)] + rnorm(n/3,0,sig)
y[(2*n/3+1):n]     = gamma3*x[(2*n/3+1):n]     + rnorm(n/3,0,sig)

par(mfrow=c(2,2))
plot(x,y,col=c(rep(1,n/3),rep(2,n/3),rep(3,n/3)),pch=16)
plot(x,y,pch=16)
ts.plot(y)
ts.plot(x)
```

# Computing marginal likelihood for $(\sigma^2,\tau^2)$
As we learned in class, the Kalman filter recursions produce the evaluation of 
$$
p(y_1,\ldots,y_n|\sigma^2,\tau^2) = \prod_{t=1}^n p(y_t|D_{t-1},\sigma^2,\tau^2),
$$
where $(y_t|D_{t-1},\sigma^2,\tau^2) \sim N(f_t,Q_t)$, with $f_t$ and $Q_t$ from the following recursion.  For $t=1,\ldots,n$,
\begin{eqnarray*}
\beta_{t-1}|D_{t-1},\sigma^2,\tau^2 &\sim& N(m_{t-1},C_{t-1})\\
\beta_t|D_{t-1},\sigma^2,\tau^2 &\sim& N(a_t,R_t)\\
y_t|D_{t-1},\sigma^2,\tau^2 &\sim& N(f_t,Q_t)\\
\beta_t|D_t,\sigma^2,\tau^2 &\sim& N(m_t,C_t),
\end{eqnarray*}
where $a_t=m_{t-1}$, $R_t=C_{t-1}+\tau^2$, $f_t=x_t a_t$, $Q_t=x_t^2 R_t+\sigma^2$, and 
$m_t = a_t + A_t(y_t-f_t)$ and $C_t=R_t-A_t^2Q_t$, for $A_t=R_tx_t/Q_t$.  

Therefore, the likelihood function of $(\sigma^2,\tau^2)$, integrating out $\beta_1,\ldots,\beta_n$, is easily obtained
$$
L(\sigma^2,\tau^2|D_n) = \prod_{t=1}^n p_N(y_t|f_t,Q_t).
$$
Therefore,
$$
({\hat \sigma}^2,{\hat \tau}^2) = \mbox{arg} \max_{(\sigma^2,\tau^2) \in \Re^{+2}} L(\sigma^2,\tau^2|D_n).
$$

```{r fig.width=12, fig.height=8}
m0    = 0
C0    = 1
N     = 50
sig2s = seq(3,5,length=N)
tau2s = seq(0.01,0.2,length=N)
loglikelihood = matrix(0,N,N)
for (i in 1:N){
  sig2 = sig2s[i]
  for (j in 1:N){
    m    = m0
    C    = C0
    tau2 = tau2s[j]
    loglike = 0
    for (t in 1:n){
      a     = m
      R     = C + tau2
      ft    = x[t]*a
      Qt    = x[t]^2*R + sig2
      loglike = loglike + dnorm(y[t],ft,sqrt(Qt),log=TRUE)
      At    = R*x[t]/Qt
      m     = a + At*(y[t]-ft)
      C     = R - At^2*Qt
    }
    loglikelihood[i,j] = loglike
  }
}

sig2.hat = sig2s[apply(loglikelihood==max(loglikelihood),1,sum)==1]
tau2.hat = tau2s[apply(loglikelihood==max(loglikelihood),2,sum)==1]

par(mfrow=c(1,1))
contour(sig2s,tau2s,exp(loglikelihood),
        xlab=expression(sigma^2),ylab=expression(tau^2),drawlabels=FALSE)
title("Likelihood function")
points(sig2.hat,tau2.hat,col=2,pch=16)
```

# Sequentially learning $\beta_1,\ldots,\beta_n$ given $({\hat \sigma}^2,{\hat \tau}^2)$


We now rerun the kalman filter recursions conditionally on $({\hat \sigma}^2,{\hat \tau}^2)$.  More precisely, we can derive the sequential posteriors
$$
p(\beta_t|D_t,{\hat \sigma}^2,{\hat \tau}^2) \equiv N(m_t,C_t),
$$
where $m_t$ and $C_t$ are intricate functions of $({\hat \sigma}^2,{\hat \tau}^2)$.  However, it is worth remembering that we are actually using the data twice.  First, we run the Kalman recursions several times and for the whole data, i.e. $\{y_1,\ldots,y_n\}$, in order to obtain $({\hat \sigma}^2,{\hat \tau}^2)$.  Second, we rerun the Kalman recursions ``pretending'' that $\sigma={\hat \sigma}$ and 
$\tau={\hat \tau}$.

```{r fig.width=12, fig.height=8}
m0    = 0
C0    = 1
sig2  = sig2.hat
tau2  = tau2.hat
mf    = rep(0,n)
Cf    = rep(0,n)
for (t in 1:n){
  a     = m
  R     = C + tau2
  ft    = x[t]*a
  Qt    = x[t]^2*R + sig2
  loglike = loglike + dnorm(y[t],ft,sqrt(Qt),log=TRUE)
  At    = R*x[t]/Qt
  m     = a + At*(y[t]-ft)
  C     = R - At^2*Qt
  mf[t] = m
  Cf[t] = C
}

qf = cbind(mf-2*sqrt(Cf),mf+2*sqrt(Cf))
ts.plot(mf,ylim=range(qf),ylab="States")
lines(qf[,1],col=2)
lines(qf[,2],col=2)
segments(1,gamma1,n/3,gamma1,col=4)
segments(n/3+1,gamma2,2*n/3,gamma2,col=4)
segments(2*n/3+1,gamma3,n,gamma3,col=4)
abline(v=n/3,lty=2,col=6)
abline(v=2*n/3,lty=2,col=6)
```

# Jointly learning $\beta_1,\ldots,\beta_n,\sigma^2,\tau^2$ 

At this point, we still will not be able to fully describe $p(\beta_t|D_t)$, unless we repeat the previous two parts for each time $t$.  If the interest is in learning about $\beta_1,\ldots,\beta_n$ and $(\sigma^2,\tau^2)$ jointly and conditionally on $D_n$, i.e. the whole data, then we can easily implement an MCMC scheme that cycles throught the full conditionals:
$p(\beta_t|\beta_{-t},\sigma^2,\tau^2,y_{1:n},x_{1:n})$,for $t=1,\ldots,n$, $p(\sigma^2|\beta_{1:n},\tau^2,y_{1:n},x_{1:n})$,and 
$p(\tau^2|\beta_{1:n},\sigma^2,y_{1:n},x_{1:n})$.


## Full conditional of $\sigma^2$

Let us assume that $\sigma^2 \sim IG(a_0,b_0)$, so
$$
\sigma^2|\beta_{1:n},y_{1:n},x_{1:n} \sim IG\left(a_0+\frac{n}{2},b_0+\frac{\sum_{t=1}^n (y_t-\beta_t x_t)^2}{2}\right).
$$

## Full conditional of $\tau^2$

Let us assume that $\tau^2 \sim IG(c_0,d_0)$, so
$$
\tau^2|\beta_{1:n} \sim IG\left(c_0+\frac{n-1}{2},d_0+\frac{\sum_{t=2}^n (\beta_t-\beta_{t-1})^2}{2}\right).
$$

## Full conditional of $\beta_t$

For $t=1$
$$
p(\beta_1|\beta_{-1},y_{1:n},x_{1:n},\sigma^2,\tau^2) = 
p(\beta_1)p(\beta_2|\beta_1,\tau^2)
p(y_1|\beta_1,x_1,\sigma^2) \equiv N(b_1,V_1),
$$
where $V_1 = 1/(1/(C_0+\tau^2)+1/\tau^2+x_1^2/\sigma^2)$ and $b_1 = V_1 (m_0/(C_0+\tau^2)+\beta_2/\tau^2+x_1y_1/\sigma^2)$.  Recall that $\beta_0 \sim N(m_0,C_0)$, so the prior $\beta_1 \sim N(m_0,C_0+\tau^2)$.

It is straighforward to see that, for $t=2,\ldots,n-1$, 
$$
p(\beta_t|\beta_{-t},y_{1:n},x_{1:n},\sigma^2,\tau^2) = 
p(\beta_t|\beta_{t-1},\tau^2)
p(\beta_{t+1}|\beta_t,\tau^2)
p(y_t|\beta_t,x_t,\sigma^2) \equiv N(b_t,V_t),
$$
where $V_t = 1/(2/\tau^2+x_t^2/\sigma^2)$ and $b_t=V_t((\beta_{t-1}+\beta_{t+1})/\tau^2+x_ty_t/\sigma^2)$.


For $t=n$
$$
p(\beta_n|\beta_{-n},y_{1:n},x_{1:n},\sigma^2,\tau^2) = 
p(\beta_n|\beta_{n-1},\tau^2)
p(y_n|\beta_n,x_n,\sigma^2) \equiv N(b_n,V_n),
$$
where $V_n = 1/(1/\tau^2+x_n^2/\sigma^2)$ and $b_n = V_n (\beta_{n-1}/\tau^2+x_ny_n/\sigma^2)$.

```{r}
a0=2;b0=1;c0=2;d0=1
a0=0.01;b0=0.01;c0=0.01;d0=0.01

# Initial values
beta  = mf
sig2  = sig2.hat
tau2  = tau2.hat
M0    = 20000
M     = 5000
niter = M0 + M
draws = matrix(0,niter,n+2)
for (iter in 1:niter){
  sig2 = 1/rgamma(1,a0+n/2,b0+sum((y-beta*x)^2)/2)
  tau2 = 1/rgamma(1,c0+(n-1)/2,d0+sum((beta[2:n]-beta[1:(n-1)])^2)/2)
  V1 = 1/(1/(C0+tau2)+1/tau2+x[1]^2/sig2)
  b1 = V1*(m0/(C0+tau2)+beta[2]/tau2+x[1]*y[1]/sig2)
  beta[1] = rnorm(1,b1,sqrt(V1))
  for (t in 2:(n-1)){
    Vt = 1/(2/tau2+x[t]^2/sig2)         
    bt = Vt*((beta[t-1]+beta[t+1])/tau2+x[t]*y[t]/sig2)
    beta[t] = rnorm(1,bt,sqrt(Vt))
  }  
  Vn = 1/(1/tau2+x[n]^2/sig2)
  bn = Vn*(beta[n-1]/tau2+x[n]*y[n]/sig2)
  beta[n] = rnorm(1,bn,sqrt(Vn))
  draws[iter,] = c(beta,sig2,tau2)
}
draws = draws[(M0+1):niter,]
```

## Trace plots

```{r fig.width=12, fig.height=8}
par(mfrow=c(2,3))
ts.plot(draws[,1],main=expression(beta[1]),xlab="iterations",ylab="")
ts.plot(draws[,100],main=expression(beta[100]),xlab="iterations",ylab="")
ts.plot(draws[,200],main=expression(beta[200]),xlab="iterations",ylab="")
ts.plot(draws[,n],main=expression(beta[n]),xlab="iterations",ylab="")
ts.plot(draws[,n+1],main=expression(sigma^2),xlab="iterations",ylab="")
ts.plot(draws[,n+2],main=expression(tau^2),xlab="iterations",ylab="")
```

## Markov chain autocorrelations

```{r fig.width=12, fig.height=8}
par(mfrow=c(1,3))
acf(draws[,n+1],main=expression(sigma^2),ylab="")
acf(draws[,n+2],main=expression(tau^2),ylab="")
acfs = matrix(0,n,50)
for (i in 1:n){
  acf = acf(draws[,i],lag.max=50,plot=FALSE)
  acfs[i,] = acf$acf[2:51]
}
plot(acfs[1,],ylim=range(acfs),ylab="ACF",type="l",xlab="Lag")
for (i in 2:n)
  lines(acfs[i,])
```

## Marginal posterior densities

Below we plot histogram approximations of marginal posterior densities for various parameters, such as $p(\beta_1|D_n)$, $p(\beta_200|D_n)$, $p(\sigma^2|D_n)$ and $p(\tau^2|D_n)$.

```{r fig.width=12, fig.height=8}
par(mfrow=c(2,3))
hist(draws[,1],main=expression(beta[1]),prob=TRUE,xlab="")
hist(draws[,100],main=expression(beta[100]),prob=TRUE,xlab="")
hist(draws[,200],main=expression(beta[200]),prob=TRUE,xlab="")
hist(draws[,n],main=expression(beta[n]),prob=TRUE,xlab="")
hist(draws[,n+1],main=expression(sigma^2),prob=TRUE,xlab="")
hist(draws[,n+2],main=expression(tau^2),prob=TRUE,xlab="")
```


## Joint posterior $p(\sigma^2,\tau^2|D_n)$

The picture above shows the contours of $p(\sigma^2,\tau^2|D_n)$ evaluated on a fine grid.  The underneath image is a 2D-histogram reconstruction this joint distribution based on the posterior draws of $\sigma^2$ and $\tau^2$.  

```{r fig.width=10, fig.height=6}
ldigamma=function(x,a,b){dgamma(1/x,a,b,log=TRUE)-2*log(x)}
logprior = matrix(0,N,N)
for (i in 1:N)
  for (j in 1:N)
    logprior[i,j] = ldigamma(sig2s[i],a0,b0)+ldigamma(tau2s[j],c0,d0)
logposterior = logprior+loglikelihood

library(MASS)
library(RColorBrewer)

rf = colorRampPalette(rev(brewer.pal(11,'Spectral')))
r  = rf(32)
xx = draws[,n+1]
yy = draws[,n+2]
df = data.frame(xx,yy)
k = kde2d(df$xx, df$yy, n=200)

sig2.mode = k$x[apply(k$z==max(k$z),1,sum)==1]
tau2.mode = k$y[apply(k$z==max(k$z),2,sum)==1]
image(k, col=r,xlab=expression(sigma^2),ylab=expression(tau^2))
points(sig2.mode,tau2.mode,col=4,pch=16)
points(sig2.hat,tau2.hat,col=3,pch=16)
contour(sig2s,tau2s,exp(logposterior),add=TRUE,drawlabels=FALSE)
```

Below we show the values of $\sigma^2$ and $\tau^2$ that maximize the likelihood funtion and the joint posterior, respectively.  Because we are using a prior that is fairly uninformative, both estimators are particularly similar.


```{r}
c(sig2.hat,sig2.mode)
c(tau2.hat,tau2.mode)
```


## Smoothed states marginal posterior densities

Below we plot the 2.5th, 50th and 97.5th percentiles of
$$
p(\beta_t|D_n) \ \ \ \mbox{and} \ \ \ p(\beta_t|D_t,{\hat \sigma}^2,{\hat \tau}^2).
$$
We ares still a bit short of producing $p(\beta_t|D_t)$, i.e. the online distribution of the states unconditionally on $(\sigma^2,\tau^2)$.  We will be able to do that once we talk about sequential Monte Carlo tools, also known as *particle filters*.

```{r fig.width=10, fig.height=6}
qbeta = t(apply(draws[,1:n],2,quantile,c(0.025,0.5,0.975)))
range = range(qbeta,qf)
par(mfrow=c(1,1))
ts.plot(qbeta,lty=c(2,1,2),ylim=range,xlab="Time",ylab=expression(beta),main="")
lines(qf[,1],col=2,lty=2)
lines(mf,col=2,lty=1)
lines(qf[,2],col=2,lty=2)
segments(1,gamma1,n/3,gamma1,col=4)
segments(n/3+1,gamma2,2*n/3,gamma2,col=4)
segments(2*n/3+1,gamma3,n,gamma3,col=4)
abline(v=n/3,lty=2,col=6)
abline(v=2*n/3,lty=2,col=6)
legend("topright",legend=c("Smoothed densities (MCMC)","Filtered densities (given sig2,tau2)"),col=1:2,lty=1)
```


# Final improvement: sampling $\beta_1,\ldots,\beta_n$ jointly

Instead of sampling $\beta_t$ from its full conditional, the *forward filtering, backward sampling* scheme samples $\beta_1,\ldots,\beta_n$ jointly.  More specifically, FFBS uses the following results:
$$
p(\beta_1,\ldots,\beta_n|\sigma^2,\tau^2,D_n) = p(\beta_n|\sigma^2,\tau^2,D_n)
\prod_{t=2}^n p(\beta_t|\beta_{t+1},\sigma^2,\tau^2,D_n),
$$
where $(\beta_n|\sigma^2,\tau^2,D_n) \sim N({\tilde m}_n,{\tilde C}_n)$, with 
${\tilde m}_n=m_n$ and ${\tilde C}_n)=C_n$.  Also,
$(\beta_t|\beta_{t+1},\sigma^2,\tau^2,D_n) \sim N({\tilde m}_n,{\tilde C}_n)$, where
$$
{\tilde C}_t = 1/(1/\tau^2+1/C_t) \ \ \mbox{and} \ \ 
{\tilde m}_t =  C_t(\beta_{t+1}/\tau^2 + m_t/C_t),
$$
for $t=n-1,\ldots,1$.

```{r}
# MCMC scheme
beta   = mf
sig2   = sig2.hat
tau2   = tau2.hat
M0     = 20000
M      = 5000
niter  = M0 + M
ms     = rep(0,n)
Cs     = rep(0,n)
m      = m0
C      = C0
draws1 = matrix(0,niter,n+2)
for (iter in 1:niter){
  for (t in 1:n){
    at    = m
    Rt    = C+tau2
    ft    = x[t]*at
    Qt    = x[t]^2*Rt+sig2
    At    = Rt*x[t]/Qt
    m     = at + At*(y[t]-ft)
    C     = Rt - At^2*Qt
    ms[t] = m
    Cs[t] = C
  }
  betas = rep(0,n)
  betas[n] = rnorm(1,ms[n],sqrt(Cs[n]))
  for (t in (n-1):1){
    var = 1/(1/tau2+1/Cs[t])
    mean = var*(betas[t+1]/tau2+ms[t]/Cs[t])
    betas[t] = rnorm(1,mean,sqrt(var))	
  }

  # sampling sig2
  sig2 = 1/rgamma(1,a0+n/2,b0+sum((y-x*betas)^2)/2)

  # sampling tau2
  tau2 = 1/rgamma(1,c0+(n-1)/2,d0+sum((betas[2:n]-betas[1:(n-1)])^2)/2)

  # storing draws
  draws1[iter,] = c(betas,sig2,tau2)
}
draws1 = draws1[(M0+1):niter,]
qbeta1 = t(apply(draws1[,1:n],2,quantile,c(0.025,0.5,0.975)))
```


```{r fig.width=12, fig.height=8}
par(mfrow=c(2,3))
plot(density(draws[,1]),main=expression(beta[1]),xlab="")
lines(density(draws1[,1]),col=2)
plot(density(draws[,100]),main=expression(beta[100]),xlab="")
lines(density(draws1[,100]),col=2)
plot(density(draws[,200]),main=expression(beta[200]),xlab="")
lines(density(draws1[,200]),col=2)
plot(density(draws[,n]),main=expression(beta[n]),xlab="")
lines(density(draws1[,n]),col=2)
plot(density(draws[,n+1]),main=expression(sigma^2),xlab="")
lines(density(draws1[,n+1]),col=2)
plot(density(draws[,n+2]),main=expression(tau^2),xlab="")
lines(density(draws1[,n+2]),col=2)
```


## $\beta_t$ autocorrelations

```{r fig.width=12, fig.height=8}
par(mfrow=c(1,2))
acfs1 = matrix(0,n,50)
for (i in 1:n){
  acf = acf(draws1[,i],lag.max=50,plot=FALSE)
  acfs1[i,] = acf$acf[2:51]
}
plot(acfs[1,],ylim=range(acfs,acfs1),ylab="ACF",type="l",xlab="Lag")
for (i in 2:n)
  lines(acfs[i,])
plot(acfs1[1,],ylim=range(acfs,acfs1),ylab="ACF",type="l",xlab="Lag")
for (i in 2:n)
  lines(acfs1[i,])
```



```{r fig.width=10, fig.height=6}
range = range(qbeta,qbeta1)
par(mfrow=c(1,1))
ts.plot(qbeta,lty=c(2,1,2),ylim=range,xlab="Time",ylab=expression(beta),main="")
lines(qbeta1[,1],col=2,lty=2)
lines(qbeta1[,2],col=2,lty=1)
lines(qbeta1[,3],col=2,lty=2)
segments(1,gamma1,n/3,gamma1,col=4)
segments(n/3+1,gamma2,2*n/3,gamma2,col=4)
segments(2*n/3+1,gamma3,n,gamma3,col=4)
abline(v=n/3,lty=2,col=6)
abline(v=2*n/3,lty=2,col=6)
legend("topright",legend=c("Block sampling","Individual sampling"),col=1:2,lty=1)
```
