This document illustrates how a simple reparametrization of the simple linear regression model can dramatically improve the mixing of a Gibbs sampler. We compare:
Model (1): the original parametrization \[ y_i = \alpha + \beta x_i + \varepsilon_i, \qquad \varepsilon_i \stackrel{iid}{\sim} N(0,\sigma^2), \qquad i=1,\ldots,n. \]
Model (2): the centered parametrization \[ y_i = \alpha^{*} + \beta^{*}(x_i - \bar x) + \varepsilon_i, \] which is equivalent to (1) via \(\beta^{*}=\beta\) and \(\alpha = \alpha^{*}-\beta^{*}\bar x\).
In both cases we use the standard noninformative prior \(p(\alpha,\beta,\sigma^{2}) \propto 1/\sigma^{2}\).
Key point. To see the mixing difference one must update the regression coefficients one at a time (true single-site Gibbs). A block update of \((\alpha,\beta)\) jointly from \(N\!\left((X^\top X)^{-1}X^\top y,\sigma^2(X^\top X)^{-1}\right)\) is nearly i.i.d. in both parametrizations — it sidesteps the very problem we want to expose. Below we therefore implement the single-site Gibbs sampler.
With \(p(\alpha,\beta,\sigma^2)\propto 1/\sigma^2\),
\[ \alpha \mid \beta,\sigma^{2},\boldsymbol y,\boldsymbol x \;\sim\; N\!\left(\bar y - \beta\bar x,\; \frac{\sigma^{2}}{n}\right), \]
\[ \beta \mid \alpha,\sigma^{2},\boldsymbol y,\boldsymbol x \;\sim\; N\!\left(\frac{\sum_i x_i(y_i-\alpha)}{\sum_i x_i^{2}},\; \frac{\sigma^{2}}{\sum_i x_i^{2}}\right), \]
\[ \sigma^{2}\mid \alpha,\beta,\boldsymbol y,\boldsymbol x \;\sim\; \mathrm{IG}\!\left(\frac{n}{2},\; \frac{\sum_i(y_i-\alpha-\beta x_i)^{2}}{2}\right). \]
For Model (2) the same formulas hold with \(x_i\) replaced by \(x_i-\bar x\), \(\alpha\) by \(\alpha^{*}\) and \(\beta\) by \(\beta^{*}\).
The geometric source of the mixing problem is now visible. Conditional on \(\beta\), the mean of \(\alpha\) is \(\bar y - \beta\bar x\): when \(\bar x \neq 0\), any move in \(\beta\) forces a deterministic shift in the conditional mean of \(\alpha\) of size \(-\bar x\,\Delta\beta\). The chain therefore walks along the ridge \(\alpha + \beta\bar x = \bar y\) one tiny step at a time. Under (2), \(\bar x\) is replaced by \(\overline{x-\bar x}=0\), so the conditional mean of \(\alpha^{*}\) is just \(\bar y\), independent of \(\beta^{*}\). That is exactly why (2) mixes so well.
n <- 100
alpha <- 1
beta <- 3
sigma <- 1
x <- rnorm(n, mean = 5, sd = 1)
y <- alpha + beta * x + rnorm(n, sd = sigma)
xc <- x - mean(x) # centered covariate
c(xbar = mean(x), ybar = mean(y))
## xbar ybar
## 5.149703 16.491752
One function does the job; passing x runs (1), passing
x - mean(x) runs (2).
gibbs_lm_scalar <- function(y, x, M = 5000, burn = 1000,
init = c(0, 0, 1)) {
n <- length(y)
sx <- sum(x)
sx2 <- sum(x^2)
sxy <- sum(x * y)
sy <- sum(y)
a <- init[1]; b <- init[2]; s2 <- init[3]
keep <- matrix(NA_real_, M + burn, 3,
dimnames = list(NULL, c("alpha", "beta", "sigma2")))
for (i in seq_len(M + burn)) {
# alpha | beta, sigma^2 ~ N( (sum(y) - beta*sum(x))/n , sigma^2/n )
mu_a <- (sy - b * sx) / n
a <- rnorm(1, mean = mu_a, sd = sqrt(s2 / n))
# beta | alpha, sigma^2 ~ N( (sum(x*y) - alpha*sum(x))/sum(x^2), sigma^2/sum(x^2) )
mu_b <- (sxy - a * sx) / sx2
b <- rnorm(1, mean = mu_b, sd = sqrt(s2 / sx2))
# sigma^2 | alpha, beta ~ IG( n/2 , sum((y - alpha - beta*x)^2)/2 )
rss <- sum((y - a - b * x)^2)
s2 <- 1 / rgamma(1, shape = n / 2, rate = rss / 2)
keep[i, ] <- c(a, b, s2)
}
keep[(burn + 1):(burn + M), , drop = FALSE]
}
M <- 5000
burn <- 1000
draws1 <- gibbs_lm_scalar(y, x, M = M, burn = burn, init = c(0, 0, 1)) # Model (1)
draws2 <- gibbs_lm_scalar(y, xc, M = M, burn = burn, init = c(0, 0, 1)) # Model (2)
Posterior summaries:
summ <- function(z) c(mean = mean(z), sd = sd(z),
q025 = quantile(z, 0.025, names = FALSE),
q975 = quantile(z, 0.975, names = FALSE))
tab <- rbind(
alpha = summ(draws1[, "alpha"]),
beta = summ(draws1[, "beta"]),
sigma2_m1 = summ(draws1[, "sigma2"]),
alpha_star = summ(draws2[, "alpha"]),
beta_star = summ(draws2[, "beta"]),
sigma2_m2 = summ(draws2[, "sigma2"])
)
round(tab, 4)
## mean sd q025 q975
## alpha 0.1531 0.6046 -1.0402 1.4023
## beta 3.1726 0.1154 2.9342 3.4010
## sigma2_m1 1.0938 0.1585 0.8258 1.4474
## alpha_star 16.4919 0.1047 16.2870 16.6954
## beta_star 3.1607 0.1120 2.9351 3.3769
## sigma2_m2 1.0899 0.1608 0.8193 1.4530
As expected, \(\widehat\alpha \approx \widehat{\alpha^{*}} - \widehat{\beta^{*}}\bar x\) and \(\widehat\beta = \widehat{\beta^{*}}\).
cor1 <- cor(draws1[, "alpha"], draws1[, "beta"])
cor2 <- cor(draws2[, "alpha"], draws2[, "beta"])
c(`cor(alpha, beta)` = round(cor1, 4),
`cor(alpha*, beta*)` = round(cor2, 4))
## cor(alpha, beta) cor(alpha*, beta*)
## -0.9854 0.0038
With \(\bar x \approx 5 \gg 0\), \(\alpha\) and \(\beta\) are very strongly negatively correlated in the posterior of Model (1) (cor \(\approx -0.98\)), whereas \(\alpha^{*}\) and \(\beta^{*}\) are nearly uncorrelated.
par(mfrow = c(3, 2), mar = c(4, 4, 3, 1))
plot(draws1[, "alpha"], type = "l", col = "steelblue",
main = expression("Model (1): trace of " * alpha),
xlab = "iteration", ylab = expression(alpha))
plot(draws2[, "alpha"], type = "l", col = "darkorange",
main = expression("Model (2): trace of " * alpha^"*"),
xlab = "iteration", ylab = expression(alpha^"*"))
plot(draws1[, "beta"], type = "l", col = "steelblue",
main = expression("Model (1): trace of " * beta),
xlab = "iteration", ylab = expression(beta))
plot(draws2[, "beta"], type = "l", col = "darkorange",
main = expression("Model (2): trace of " * beta^"*"),
xlab = "iteration", ylab = expression(beta^"*"))
acf(draws1[, "beta"], main = expression("ACF of " * beta * " (Model 1)"),
col = "steelblue", lwd = 2, lag.max = 100)
acf(draws2[, "beta"], main = expression("ACF of " * beta^"*" * " (Model 2)"),
col = "darkorange", lwd = 2, lag.max = 100)
par(mfrow = c(1, 2), mar = c(4, 4, 3, 1))
plot(draws1[, "alpha"], draws1[, "beta"], pch = 20, cex = 0.4,
col = adjustcolor("steelblue", alpha.f = 0.4),
xlab = expression(alpha), ylab = expression(beta),
main = sprintf("Model (1): cor = %.3f", cor1))
plot(draws2[, "alpha"], draws2[, "beta"], pch = 20, cex = 0.4,
col = adjustcolor("darkorange", alpha.f = 0.4),
xlab = expression(alpha^"*"), ylab = expression(beta^"*"),
main = sprintf("Model (2): cor = %.3f", cor2))
In Model (1) the chain crawls along the narrow ridge — visible both in the trace plots (slow, drifting excursions) and in the ACF (decays very slowly). In Model (2) traces look like white noise and the ACF collapses at lag 1.
We use coda::effectiveSize
(spectral-density estimator) and, as a cross-check, Geyer’s (1992)
initial-positive-sequence estimator:
# Geyer (1992) initial positive sequence: pair lags 2m + (2m+1), keep summing
# while the paired sum is positive. Guaranteed conservative (does not
# underestimate the integrated autocorrelation time).
ess_geyer <- function(x) {
n <- length(x)
rho <- as.numeric(acf(x, plot = FALSE,
lag.max = min(n - 1, 1000))$acf)[-1]
K <- floor(length(rho) / 2)
Gamma <- rho[2 * (1:K) - 1] + rho[2 * (1:K)]
stop_at <- which(Gamma <= 0)[1]
if (is.na(stop_at)) stop_at <- K + 1
m_max <- stop_at - 1 # number of positive paired sums
tau <- if (m_max < 1) 1 else 1 + 2 * sum(rho[1:(2 * m_max)])
n / max(tau, 1)
}
if (!requireNamespace("coda", quietly = TRUE)) {
install.packages("coda", repos = "https://cloud.r-project.org")
}
ess_coda <- function(z) as.numeric(coda::effectiveSize(z))
ess_tab <- rbind(
`Model (1) (coda)` = c(alpha = ess_coda(draws1[, "alpha"]),
beta = ess_coda(draws1[, "beta"]),
sigma2 = ess_coda(draws1[, "sigma2"])),
`Model (2) (coda)` = c(alpha = ess_coda(draws2[, "alpha"]),
beta = ess_coda(draws2[, "beta"]),
sigma2 = ess_coda(draws2[, "sigma2"])),
`Model (1) (Geyer)` = c(alpha = ess_geyer(draws1[, "alpha"]),
beta = ess_geyer(draws1[, "beta"]),
sigma2 = ess_geyer(draws1[, "sigma2"])),
`Model (2) (Geyer)` = c(alpha = ess_geyer(draws2[, "alpha"]),
beta = ess_geyer(draws2[, "beta"]),
sigma2 = ess_geyer(draws2[, "sigma2"]))
)
round(ess_tab, 1)
## alpha beta sigma2
## Model (1) (coda) 74.4 73.7 4554.1
## Model (2) (coda) 5026.9 4662.3 4714.4
## Model (1) (Geyer) 75.8 75.7 3773.4
## Model (2) (Geyer) 5000.0 4579.1 4728.9
Out of \(M = 5000\) post-burn-in draws, Model (2) yields ESS close to \(M\) for \(\alpha^{*}\) and \(\beta^{*}\), while Model (1) yields only a small handful of effective draws for \(\alpha\) and \(\beta\). The ESS of \(\sigma^2\) is essentially \(M\) in both — it isn’t entangled with the ridge.
A simple estimator that stops at the first negative autocorrelation and sums only the positive \(\rho_k\)’s can severely overestimate ESS for a very sticky chain: many early \(\rho_k\) are noisy estimates that dip negative purely by sampling error, the sum is truncated too early, and the resulting \(\hat\tau\) comes out far below the true integrated autocorrelation time. Geyer’s initial-positive-sequence estimator (used above) avoids this by summing paired lags \(\rho_{2m}+\rho_{2m+1}\) — a function known to be positive and monotone in \(m\) for a reversible chain.
Centering the covariate makes the columns of the design matrix orthogonal, so the conditional mean of \(\alpha^{*}\) given \(\beta^{*}\) no longer depends on \(\beta^{*}\) (and vice versa). Single-site Gibbs under (1) crawls along a steep ridge whose slope is \(-\bar x\), while under (2) it samples essentially i.i.d. The reparametrization costs nothing — recover \((\alpha,\beta)\) via \(\alpha = \alpha^{*} - \beta^{*}\bar x\), \(\beta = \beta^{*}\) — but the gain in MCMC efficiency is dramatic.