A Markov chain
example
Suppose that \(P\) is the transition
matrix of a discrete-time (homogeneous) Markov chain with 5 states:
\[
P_{ij} = Pr(X_t=j|X_{t-1}=i) \qquad \forall t \ \ \mbox{and} \ \forall
i,j
\] Let \(P\) be the following
matrix: \[
P =
\begin{bmatrix}
0.6&0.1&0.1&0.1&0.1\\
0.0&0.4&0.2&0.1&0.3\\
0.1&0.0&0.5&0.3&0.1\\
0.0&0.0&0.1&0.8&0.1\\
0.1&0.1&0.1&0.0&0.7
\end{bmatrix}
\] It is easy to show that the probability of starting at state
\(i\) (time \(t=0\))and reaching state \(j\) after \(k\) (time \(t=k\)) iterations of the Markov chain is
equal to \[
Pr(X_k=j|X_0=i) = (P^k)_{ij}
\]
P = matrix(c(
0.6,0.1,0.1,0.1,0.1,
0.0,0.4,0.2,0.1,0.3,
0.1,0.0,0.5,0.3,0.1,
0.0,0.0,0.1,0.8,0.1,
0.1,0.1,0.1,0.0,0.7),5,5,byrow=TRUE)
P
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.6 0.1 0.1 0.1 0.1
## [2,] 0.0 0.4 0.2 0.1 0.3
## [3,] 0.1 0.0 0.5 0.3 0.1
## [4,] 0.0 0.0 0.1 0.8 0.1
## [5,] 0.1 0.1 0.1 0.0 0.7
For example \(P^5\) is the following
matrix
P5 = P%*%P%*%P%*%P%*%P
round(P5,4)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.1617 0.0811 0.1803 0.3043 0.2726
## [2,] 0.1184 0.0809 0.1869 0.2942 0.3197
## [3,] 0.1098 0.0525 0.1816 0.4106 0.2456
## [4,] 0.0752 0.0415 0.1693 0.4735 0.2405
## [5,] 0.1495 0.0972 0.1828 0.2140 0.3565
such that \[
Pr(X_5=4|X_0=2) = (P^5)_{24} = 0.2942 \ > \ 0.1 = P_{24} =
Pr(X_1=4|X_0=2)
\]
Limiting/equilibrium
distribution
How does \(P^{100}\) looks like?
P1 = P
for (k in 1:100)
P1 = P1%*%P
round(P1,4)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.1152 0.0664 0.1777 0.3574 0.2832
## [2,] 0.1152 0.0664 0.1777 0.3574 0.2832
## [3,] 0.1152 0.0664 0.1777 0.3574 0.2832
## [4,] 0.1152 0.0664 0.1777 0.3574 0.2832
## [5,] 0.1152 0.0664 0.1777 0.3574 0.2832
In other words, after a large number of iterations (\(k=100\) here) the rows of the transition
matrix converge to the same probability distribution for the 3 states:
\[
Pr(X_k=j|X_0=i) = Pr(X_k=j) = \pi_j \qquad \forall i,j.
\] The distribution \(\pi=(\pi_1,\pi_2,\pi_3)\) is known as the
limiting distribution of the Markov chain, or its
equilibrium distribution.
Markov chain and
Bayesian computation
Markov chain Monte Carlo (MCMC) schemes are Monte Carlo schemes that
build Markov chains whose equilibrium/limiting distribution is the
posterior distribution of interest. Gibbs sampler, random-walk
Metropolis-Hastings, independent Metropolis-Hastings, Hamiltonian Monte
Carlo, Delayed Metropolis, Multiple-try Metropolis are all schemes based
on well defined Markov chains structures.
Sampling the path of
the Markov chain
Let us know compare the theoretical equilibrium distribution, \[
\pi=(0.1152,0.0664,0.1777,0.3574,0.2832),
\] with its sampling approximation based on the path of the
simulated Markov chain.
xs = rep(0,100000)
x = 1
for (t in 1:100000){
x = sample(1:5,size=1,prob=P[x,])
xs[t] = x
}
xs = xs[95001:100000]
par(mfrow=c(2,1))
ts.plot(xs,ylab="State of the Markov chain",xlab="Iteration (after burn-in of size 95000)")
plot(table(xs)/5000,ylab="Relative frequency",lwd=6,xlab="States")
lines(P1[1,],type="h",col=2,lwd=2)
legend("topleft",legend=c("Theoretical limiting probability","Markov chain simulation"),col=2:1,lwd=3,bty="n")
tab = rbind(P1[1,],table(xs)/5000)
rownames(tab) = c("Theoretical","Simulated")
round(tab,4)
## 1 2 3 4 5
## Theoretical 0.1152 0.0664 0.1777 0.3574 0.2832
## Simulated 0.1110 0.0622 0.1694 0.3580 0.2994
Multiple linear
regression via Gibbs sampler
Suppose the data is \(\mbox{data}=\{(y_1,x_1),\ldots,(y_n,x_n)\}\),
where \(y_i\) is a real number and
\(x_i=(x_{i1},\dots,x_{ip})'\) is a
\(p\)-dimensional vector of explanatory
variables, also known as regressors or features or characteristics of
unit \(i\), for \(i=1,\ldots,n\). One of the most common
models for this kind of arrangement is that of a Gaussian linear
regression, where \[
y_i | x_i,\beta,\sigma^2 \sim N(x_i'\beta,\sigma^2),
\] where \(\beta=(\beta_1,\ldots,\beta_p)'\) and
\(y\)s conditionally independent given
\(x_i,\beta,\sigma^2\).
Likelihood
function
This heavily structured model (linearity, gaussianity, exogeneity,
homokedasticity) has likelihood \[
L(\beta_1,\ldots,\beta_p,\sigma^2) = \prod_{i=1}^n
(2\pi\sigma^2)^{-1/2}\exp\left\{\frac{1}{2\sigma
^2}(y_i-x_i'\beta)^2\right\},
\] for \(\beta_i \in \Re\) and
\(\sigma^2 \in \Re^+\). In matrix
notation, \[
y|X,\beta,\sigma^2 \sim N(X\beta,\sigma^2I_n),
\] where \(y=(y_1,\ldots,y_n)\),
\(X=(x_1,\ldots,x_n)'\) of
dimension \((n \times p)\) and \(I_n\) is the identity matrix of order \(n\). In this case the likelihood becomes
\[
L(\beta,\sigma^2) \propto
(\sigma^2)^{-\frac{n}{2}}\exp\left\{\frac{1}{2\sigma
^2}(y-X\beta)'(y-X\beta)\right\},
\] A further simplication leads to \[
L(\beta,\sigma^2) \propto
(\sigma^2)^{-\frac{n}{2}}\exp\left\{\frac{1}{2\sigma ^2}(\beta'
X'X \beta - 2\beta'X'y+y'y)\right\}.
\] In other words, the sufficient statistics for the Gaussian
linear regression are the cross-products \(S_{xx}=X'X\), \(S_{xy}=X'y\) and \(S_{yy}=y'y\).
Prior
distribution
Let us assume that, for \(i=1,\ldots,p\), \(\beta_i \sim N(0,\tau^2)\), say \(\tau=1\) or \(\tau=10\), and \(p(\sigma^2) \propto 1/\sigma^2\). This way
the prior of \((\beta,\sigma^2)\) is
\[
p(\beta,\sigma^2) \propto
\frac{1}{\sigma^2}\exp\left\{-\frac{1}{2\tau^2}\sum_{j=1}^p
\beta_j^2\right\}=
\frac{1}{\sigma^2}\exp\left\{-\frac{1}{2\tau^2} \beta'\beta\right\}
\]
Posterior
distribution
Combining the likelihood function with the prior density, we arrive
at the kernel of the posterior density \[
p(\beta,\sigma^2|\mbox{data}) \propto
\left[(\sigma^2)^{-1}\exp\left\{-\frac{1}{2\tau^2}
\beta'\beta\right\}\right] \times
\left[(\sigma^2)^{-\frac{n}{2}}\exp\left\{-\frac{1}{2\sigma
^2}(\beta' X'X \beta -
2\beta'X'y+y'y)\right\}\right].
\]
Full conditional
distributions
Suppose we do not recognize this joint posterior, but are able to
derive the full conditionals \[
p(\beta|\sigma^2,\mbox{data}) \ \ \ \mbox{and} \ \ \
p(\sigma^2|\beta,\mbox{data}).
\]
Full conditional of
\(\sigma^2\)
\[
p(\sigma^2|\beta,\mbox{data}) \propto
(\sigma^2)^{-\left(\frac{n}{2}+1\right)}
\exp\left\{-\frac{1}{2\sigma ^2}\left(\beta' X'X \beta -
2\beta'X'y+y'y\right)\right\},
\] which looks like the kernel of an following inverse-gamma
distribution \[
(\sigma^2|\beta,\mbox{data}) \sim IG\left(\frac{n}{2},\frac{\beta'
S_{xx} \beta - 2\beta'S_{xy}+S_{yy})}{2}\right) \equiv
IG\left(\frac{n}{2},\frac{(y-X\beta)'(y-X\beta)}{2}\right).
\]
Full conditional of
\(\beta\)
We can see that \[
p(\beta|\sigma^2,\mbox{data}) \propto \exp\left\{-\frac{1}{2\tau^2}
\beta'\beta\right\}
\exp\left\{-\frac{1}{2\sigma ^2}(\beta' X'X \beta -
2\beta'X'y)\right\} =
\exp\left\{-\frac{1}{2}\left(\beta'(I_p/\tau^2+X'X/\sigma^2)\beta-2\beta'X'y/\sigma^2\right)\right\},
\] which looks like the kernel of a multivariate normal
distribution with mean and variance \[
(I_p/\tau^2+X'X/\sigma^2)^{-1}X'y/\sigma^2 \ \ \ \mbox{and} \ \
\ (I_p/\tau^2+X'X/\sigma^2)^{-1},
\] or \[
(\beta|\sigma^2,\mbox{data}) \sim
N\left[(I_p/\tau^2+S_{xx}/\sigma^2)^{-1}S_{xy}/sigma^2,
(I_p/\tau^2+S_{xx}/\sigma^2)^{-1}\right].
\]
Gibbs sampler
Sample \(\beta\) from \(N\left[(I_p/\tau^2+S_{xx}/\sigma^2)^{-1}S_{xy}/\sigma^2,
(I_p/\tau^2+S_{xx}/\sigma^2)^{-1}\right]\)
Sample \(\sigma^2\) from \(IG\left[n/2,(\beta' S_{xx} \beta -
2\beta'S_{xy}+S_{yy})/2\right]\)
Return on
education
The data is a 1976 Panel Study of Income Dynamics, based on data for
the previous year, 1975. Of the 753 observations, the first 428 are for
women with positive hours worked in 1975, while the remaining 325
observations are for women who did not work for pay in 1975. A more
complete discussion of the data is found in Mroz [1987], Appendix 1.
Thomas A. Mroz (1987) The Sensitivity of an Empirical Model of Married
Women’s Hours of Work to Economic and Statistical Assumptions.
Econometrica, Vol. 55, No. 4 (July 1987), pp. 765-799. Stable URL: http://www.jstor.org/stable/1911029. The variables in
the dataset are as follows:
- LFP “A dummy variable = 1 if woman worked in 1975, else 0”;
- WHRS “Wife’s hours of work in 1975”;
- KL6 “Number of children less than 6 years old in household”;
- K618 “Number of children between ages 6 and 18 in household”;
- WA “Wife’s age”;
- WE “Wife’s educational attainment, in years”;
- WW “Wife’s average hourly earnings, in 1975 dollars”;
- RPWG “Wife’s wage reported at the time of the 1976 interview (not
the same as the 1975 estimated wage). To use the subsample with this
wage, one needs to select 1975 workers with LFP=1, then select only
those women with non-zero RPWG. Only 325 women work in 1975 and have a
non-zero RPWG in 1976.”;
- HHRS “Husband’s hours worked in 1975”;
- HA “Husband’s age”;
- HE “Husband’s educational attainment, in years”;
- HW “Husband’s wage, in 1975 dollars”;
- FAMINC “Family income, in 1975 dollars. This variable is used to
construct the property income variable.”;
- MTR “This is the marginal tax rate facing the wife, and is taken
from published federal tax tables (state and local income taxes are
excluded). The taxable income on which this tax rate is calculated
includes Social Security, if applicable to wife.”;
- WMED “Wife’s mother’s educational attainment, in years”;
- WFED “Wife’s father’s educational attainment, in years”;
- UN “Unemployment rate in county of residence, in percentage points.
This taken from bracketed ranges.”;
- CIT “Dummy variable = 1 if live in large city (SMSA), else 0”;
- AX “Actual years of wife’s previous labor market experience”;
data = read.table("http://hedibert.org/wp-content/uploads/2020/01/mroz-data.txt",header=TRUE)
attach(data)
names= c("working (W)",
"hours work (W)",
"children 6-18",
"hourly earnings (W)",
"hours worked (H)",
"wage (H)",
"marginal tax rate (W)",
"live in large city")
y = scale(log(FAMINC))
X = cbind(LFP,WHRS,K618,WW,HHRS,HW,MTR,CIT)
n = nrow(X)
p = ncol(X)
par(mfrow=c(2,4))
boxplot(y~X[,1],xlab=names[1],main="",outline=FALSE,names=c("No","Yes"),ylab="Family income (log)")
for (i in 2:(p-1))
plot(X[,i],y,xlab=names[i],ylab="Family income (log)")
boxplot(y~X[,p],xlab=names[p],main="",outline=FALSE,names=c("No","Yes"),ylab="Family income (log)")
Ordinary least
squares
fit.ols = lm(y~X-1)
summary(fit.ols)
##
## Call:
## lm(formula = y ~ X - 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.6820 -0.1883 -0.0045 0.2341 3.0471
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## XLFP -2.267e-01 6.749e-02 -3.360 0.000820 ***
## XWHRS 2.864e-04 3.250e-05 8.812 < 2e-16 ***
## XK618 3.537e-02 1.455e-02 2.432 0.015268 *
## XWW 4.378e-02 7.672e-03 5.706 1.67e-08 ***
## XHHRS 4.553e-04 2.776e-05 16.400 < 2e-16 ***
## XHW 1.271e-01 4.311e-03 29.488 < 2e-16 ***
## XMTR -3.428e+00 1.027e-01 -33.374 < 2e-16 ***
## XCIT 1.528e-01 4.125e-02 3.704 0.000228 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5136 on 745 degrees of freedom
## Multiple R-squared: 0.7387, Adjusted R-squared: 0.7359
## F-statistic: 263.2 on 8 and 745 DF, p-value: < 2.2e-16
beta.ols = fit.ols$coef
sig.ols = sqrt(sum((y-X%*%beta.ols)^2)/(n-p))
Bayesian posterior
inference via Gibbs sampler
library("mvtnorm")
tau2 = 1.0
Sxx = t(X)%*%X
Sxy = t(X)%*%y
a = n/2
Ip = diag(1/tau2,p)
sigma2 = 1
M0 = 10000
M = 10000
niter = M0+M
draws = matrix(0,niter,p+1)
for (iter in 1:niter){
V = solve(Ip+Sxx/sigma2)
m = V%*%Sxy/sigma2
beta = matrix(rmvnorm(1,mean=m,sigma=V),p,1)
sigma2 = 1/rgamma(1,a,sum((y-X%*%beta)^2)/2)
draws[iter,] = c(beta,sqrt(sigma2))
}
draws = draws[(M0+1):niter,]
Graphical
summaries
par(mfrow=c(3,3))
for (i in 1:p){
hist(draws[,i],prob=TRUE,main=names[i],xlab=paste("beta(",i,")",sep=""))
abline(v=0,col=2,lwd=3)
abline(v=beta.ols[i],col=3,lwd=3)
}
hist(draws[,p+1],prob=TRUE,main="sigma",xlab="")
abline(v=sig.ols,col=3,lwd=3)
---
title: "Class material"
author: "Hedibert Freitas Lopes"
date: "5/21/2024"
output:
  html_document:
    toc: true
    toc_depth: 3
    toc_collapsed: true
    code_download: yes
    number_sections: true
---

# A Markov chain example
Suppose that $P$ is the transition matrix of a discrete-time (homogeneous) Markov chain with 5 states:
$$
P_{ij} = Pr(X_t=j|X_{t-1}=i) \qquad \forall t \ \ \mbox{and} \ \forall i,j
$$
Let $P$ be the following matrix:
$$
P =
\begin{bmatrix}
0.6&0.1&0.1&0.1&0.1\\
0.0&0.4&0.2&0.1&0.3\\
0.1&0.0&0.5&0.3&0.1\\
0.0&0.0&0.1&0.8&0.1\\
0.1&0.1&0.1&0.0&0.7
\end{bmatrix}
$$
It is easy to show that the probability of starting at state $i$ (time $t=0$)and reaching state $j$ after $k$ (time $t=k$) iterations of the Markov chain is equal to
$$
Pr(X_k=j|X_0=i) = (P^k)_{ij}
$$

```{r}
P = matrix(c(
0.6,0.1,0.1,0.1,0.1,
0.0,0.4,0.2,0.1,0.3,
0.1,0.0,0.5,0.3,0.1,
0.0,0.0,0.1,0.8,0.1,
0.1,0.1,0.1,0.0,0.7),5,5,byrow=TRUE)
P
```

For example $P^5$ is the following matrix
```{r}
P5 = P%*%P%*%P%*%P%*%P
round(P5,4)
```
such that 
$$
Pr(X_5=4|X_0=2) = (P^5)_{24} = 0.2942 \ > \ 0.1 = P_{24} = Pr(X_1=4|X_0=2)
$$

## Limiting/equilibrium distribution

How does $P^{100}$ looks like?
```{r}
P1 = P
for (k in 1:100)
  P1 = P1%*%P
round(P1,4)
```

In other words, after a large number of iterations ($k=100$ here) the rows of the transition matrix converge to the same probability distribution for the 3 states:
$$
Pr(X_k=j|X_0=i) = Pr(X_k=j) = \pi_j \qquad \forall i,j.
$$
The distribution $\pi=(\pi_1,\pi_2,\pi_3)$ is known as the *limiting distribution* of the Markov chain, or its  *equilibrium distribution*.

## Markov chain and Bayesian computation

Markov chain Monte Carlo (MCMC) schemes are Monte Carlo schemes that build Markov chains whose equilibrium/limiting distribution is the posterior distribution of interest.  Gibbs sampler, random-walk Metropolis-Hastings, independent Metropolis-Hastings, Hamiltonian Monte Carlo, Delayed Metropolis, Multiple-try Metropolis are all schemes based on well defined Markov chains structures.

## Sampling the path of the Markov chain
Let us know compare the theoretical equilibrium distribution, 
$$
\pi=(0.1152,0.0664,0.1777,0.3574,0.2832),
$$
with its sampling approximation based on the path of the simulated Markov chain.

```{r fig.width=6, fig.height=10, fig.align='center'}
xs = rep(0,100000)
x  = 1
for (t in 1:100000){
  x = sample(1:5,size=1,prob=P[x,])
  xs[t] = x
}
xs = xs[95001:100000]

par(mfrow=c(2,1))
ts.plot(xs,ylab="State of the Markov chain",xlab="Iteration (after burn-in of size 95000)")
plot(table(xs)/5000,ylab="Relative frequency",lwd=6,xlab="States")
lines(P1[1,],type="h",col=2,lwd=2)
legend("topleft",legend=c("Theoretical limiting probability","Markov chain simulation"),col=2:1,lwd=3,bty="n")

tab = rbind(P1[1,],table(xs)/5000)
rownames(tab) = c("Theoretical","Simulated")
round(tab,4)
```

# Multiple linear regression via Gibbs sampler

Suppose the data is $\mbox{data}=\{(y_1,x_1),\ldots,(y_n,x_n)\}$, where $y_i$ is a real number and $x_i=(x_{i1},\dots,x_{ip})'$ is a $p$-dimensional vector of explanatory variables, also known as regressors or features or characteristics of unit $i$, for $i=1,\ldots,n$.  One of the most common models for this kind of arrangement is that of a Gaussian linear regression, where 
$$
y_i | x_i,\beta,\sigma^2 \sim N(x_i'\beta,\sigma^2),
$$
where $\beta=(\beta_1,\ldots,\beta_p)'$ and $y$s conditionally independent given $x_i,\beta,\sigma^2$.  

## Likelihood function
This heavily structured model (linearity, gaussianity, exogeneity, homokedasticity) has likelihood 
$$
L(\beta_1,\ldots,\beta_p,\sigma^2) = \prod_{i=1}^n (2\pi\sigma^2)^{-1/2}\exp\left\{\frac{1}{2\sigma ^2}(y_i-x_i'\beta)^2\right\},
$$
for $\beta_i \in \Re$ and $\sigma^2 \in \Re^+$.  In matrix notation, 
$$
y|X,\beta,\sigma^2 \sim N(X\beta,\sigma^2I_n),
$$
where $y=(y_1,\ldots,y_n)$, $X=(x_1,\ldots,x_n)'$ of dimension $(n \times p)$ and $I_n$ is the identity matrix of order $n$.  In this case the likelihood becomes
$$
L(\beta,\sigma^2) \propto (\sigma^2)^{-\frac{n}{2}}\exp\left\{\frac{1}{2\sigma ^2}(y-X\beta)'(y-X\beta)\right\},
$$
A further simplication leads to 
$$
L(\beta,\sigma^2) \propto (\sigma^2)^{-\frac{n}{2}}\exp\left\{\frac{1}{2\sigma ^2}(\beta' X'X \beta - 2\beta'X'y+y'y)\right\}.
$$
In other words, the sufficient statistics for the Gaussian linear regression are the cross-products
$S_{xx}=X'X$, $S_{xy}=X'y$ and $S_{yy}=y'y$.

## Prior distribution
Let us assume that, for $i=1,\ldots,p$, $\beta_i \sim N(0,\tau^2)$, say $\tau=1$ or $\tau=10$, and $p(\sigma^2) \propto 1/\sigma^2$.  This way the prior of $(\beta,\sigma^2)$ is 
$$
p(\beta,\sigma^2) \propto \frac{1}{\sigma^2}\exp\left\{-\frac{1}{2\tau^2}\sum_{j=1}^p \beta_j^2\right\}=
\frac{1}{\sigma^2}\exp\left\{-\frac{1}{2\tau^2} \beta'\beta\right\}
$$

## Posterior distribution

Combining the likelihood function with the prior density, we arrive at the kernel of the posterior density
$$
p(\beta,\sigma^2|\mbox{data}) \propto 
\left[(\sigma^2)^{-1}\exp\left\{-\frac{1}{2\tau^2} \beta'\beta\right\}\right] \times 
\left[(\sigma^2)^{-\frac{n}{2}}\exp\left\{-\frac{1}{2\sigma ^2}(\beta' X'X \beta - 2\beta'X'y+y'y)\right\}\right].
$$

## Full conditional distributions

Suppose we do not recognize this joint posterior, but are able to derive the full conditionals 
$$
p(\beta|\sigma^2,\mbox{data}) \ \ \ \mbox{and} \ \ \ p(\sigma^2|\beta,\mbox{data}).
$$

### Full conditional of $\sigma^2$ 

$$
p(\sigma^2|\beta,\mbox{data}) \propto (\sigma^2)^{-\left(\frac{n}{2}+1\right)}
\exp\left\{-\frac{1}{2\sigma ^2}\left(\beta' X'X \beta - 2\beta'X'y+y'y\right)\right\},
$$
which looks like the kernel of an following inverse-gamma distribution
$$
(\sigma^2|\beta,\mbox{data}) \sim IG\left(\frac{n}{2},\frac{\beta' S_{xx} \beta - 2\beta'S_{xy}+S_{yy})}{2}\right) \equiv IG\left(\frac{n}{2},\frac{(y-X\beta)'(y-X\beta)}{2}\right).
$$

### Full conditional of $\beta$

We can see that
$$
p(\beta|\sigma^2,\mbox{data}) \propto \exp\left\{-\frac{1}{2\tau^2} \beta'\beta\right\}
\exp\left\{-\frac{1}{2\sigma ^2}(\beta' X'X \beta - 2\beta'X'y)\right\} =
\exp\left\{-\frac{1}{2}\left(\beta'(I_p/\tau^2+X'X/\sigma^2)\beta-2\beta'X'y/\sigma^2\right)\right\},
$$
which looks like the kernel of a multivariate normal distribution with mean and variance
$$
(I_p/\tau^2+X'X/\sigma^2)^{-1}X'y/\sigma^2  \ \ \ \mbox{and} \ \ \ (I_p/\tau^2+X'X/\sigma^2)^{-1},
$$
or
$$
(\beta|\sigma^2,\mbox{data}) \sim N\left[(I_p/\tau^2+S_{xx}/\sigma^2)^{-1}S_{xy}/sigma^2, (I_p/\tau^2+S_{xx}/\sigma^2)^{-1}\right].
$$

## Gibbs sampler

1) Sample $\beta$ from $N\left[(I_p/\tau^2+S_{xx}/\sigma^2)^{-1}S_{xy}/\sigma^2, (I_p/\tau^2+S_{xx}/\sigma^2)^{-1}\right]$

2) Sample $\sigma^2$ from $IG\left[n/2,(\beta' S_{xx} \beta - 2\beta'S_{xy}+S_{yy})/2\right]$



# Return on education

The data is a 1976 Panel Study of Income Dynamics, based on data for the previous year, 1975.  Of the 753 observations, the first 428 are for women with positive hours worked in 1975, while the remaining 325 observations are for women who did not work for pay in 1975.  A more complete discussion of the data is found in Mroz [1987], Appendix 1.   Thomas A. Mroz (1987) The Sensitivity of an Empirical Model of Married Women's Hours of Work to Economic and Statistical Assumptions.  Econometrica, Vol. 55, No. 4 (July 1987), pp. 765-799.  Stable URL: http://www.jstor.org/stable/1911029.  The variables in the dataset are as follows:


* LFP  "A dummy variable = 1 if woman worked in 1975, else 0";
* WHRS "Wife's hours of work in 1975";
* KL6  "Number of children less than 6 years old in household";
* K618 "Number of children between ages 6 and 18 in household";
* WA   "Wife's age";
* WE   "Wife's educational attainment, in years";
* WW   "Wife's average hourly earnings, in 1975 dollars";
* RPWG "Wife's wage reported at the time of the 1976 interview (not the same as the 1975 estimated wage).  To use the subsample with this wage, one needs to select 1975 workers with LFP=1, then select only those women with non-zero RPWG.  Only 325 women work in 1975 and have a non-zero RPWG in 1976.";
* HHRS "Husband's hours worked in 1975";
* HA   "Husband's age";
* HE   "Husband's educational attainment, in years";
* HW   "Husband's wage, in 1975 dollars";
* FAMINC "Family income, in 1975 dollars.  This variable is used to construct the property income variable.";
* MTR  "This is the marginal tax rate facing the wife, and is taken from published federal tax tables (state and local income taxes are excluded). The taxable income on which this tax rate is calculated includes Social Security, if applicable to wife.";
* WMED "Wife's mother's educational attainment, in years";
* WFED "Wife's father's educational attainment, in years";
* UN   "Unemployment rate in county of residence, in percentage points.  This taken from bracketed ranges.";
* CIT  "Dummy variable = 1 if live in large city (SMSA), else 0";
* AX   "Actual years of wife's previous labor market experience";

```{r fig.width=10, fig.height=6, fig.align='center'}
data = read.table("http://hedibert.org/wp-content/uploads/2020/01/mroz-data.txt",header=TRUE)

attach(data)

names= c("working (W)",
         "hours work (W)",
         "children 6-18",
         "hourly earnings (W)",
         "hours worked (H)",
        "wage (H)",
         "marginal tax rate (W)",
         "live in large city")

y = scale(log(FAMINC))
X = cbind(LFP,WHRS,K618,WW,HHRS,HW,MTR,CIT)
n = nrow(X)
p = ncol(X)


par(mfrow=c(2,4))
boxplot(y~X[,1],xlab=names[1],main="",outline=FALSE,names=c("No","Yes"),ylab="Family income (log)")
for (i in 2:(p-1))
  plot(X[,i],y,xlab=names[i],ylab="Family income (log)")
boxplot(y~X[,p],xlab=names[p],main="",outline=FALSE,names=c("No","Yes"),ylab="Family income (log)")
```

## Ordinary least squares
```{r}
fit.ols = lm(y~X-1)
summary(fit.ols)
beta.ols = fit.ols$coef
sig.ols  = sqrt(sum((y-X%*%beta.ols)^2)/(n-p))
```

## Bayesian posterior inference via Gibbs sampler
```{r}
library("mvtnorm")
tau2   = 1.0
Sxx    = t(X)%*%X
Sxy    = t(X)%*%y
a      = n/2
Ip     = diag(1/tau2,p)
sigma2 = 1
M0     = 10000
M      = 10000
niter  = M0+M
draws  = matrix(0,niter,p+1)
for (iter in 1:niter){
  V      = solve(Ip+Sxx/sigma2)
  m      = V%*%Sxy/sigma2
  beta   = matrix(rmvnorm(1,mean=m,sigma=V),p,1)
  sigma2 = 1/rgamma(1,a,sum((y-X%*%beta)^2)/2)
  draws[iter,] = c(beta,sqrt(sigma2))
}
draws = draws[(M0+1):niter,]
```

## Graphical summaries

```{r fig.width=10, fig.height=7, fig.align='center'}
par(mfrow=c(3,3))
for (i in 1:p){
  hist(draws[,i],prob=TRUE,main=names[i],xlab=paste("beta(",i,")",sep=""))
  abline(v=0,col=2,lwd=3)
  abline(v=beta.ols[i],col=3,lwd=3)
}
hist(draws[,p+1],prob=TRUE,main="sigma",xlab="")
abline(v=sig.ols,col=3,lwd=3)
```
