1 Introduction

This document illustrates how a simple reparametrization of the simple linear regression model can dramatically improve the mixing of a Gibbs sampler. We compare:

In both cases we use the standard noninformative prior \(p(\alpha,\beta,\sigma^{2}) \propto 1/\sigma^{2}\).

Key point. To see the mixing difference one must update the regression coefficients one at a time (true single-site Gibbs). A block update of \((\alpha,\beta)\) jointly from \(N\!\left((X^\top X)^{-1}X^\top y,\sigma^2(X^\top X)^{-1}\right)\) is nearly i.i.d. in both parametrizations — it sidesteps the very problem we want to expose. Below we therefore implement the single-site Gibbs sampler.

2 Single-site full conditionals

With \(p(\alpha,\beta,\sigma^2)\propto 1/\sigma^2\),

\[ \alpha \mid \beta,\sigma^{2},\boldsymbol y,\boldsymbol x \;\sim\; N\!\left(\bar y - \beta\bar x,\; \frac{\sigma^{2}}{n}\right), \]

\[ \beta \mid \alpha,\sigma^{2},\boldsymbol y,\boldsymbol x \;\sim\; N\!\left(\frac{\sum_i x_i(y_i-\alpha)}{\sum_i x_i^{2}},\; \frac{\sigma^{2}}{\sum_i x_i^{2}}\right), \]

\[ \sigma^{2}\mid \alpha,\beta,\boldsymbol y,\boldsymbol x \;\sim\; \mathrm{IG}\!\left(\frac{n}{2},\; \frac{\sum_i(y_i-\alpha-\beta x_i)^{2}}{2}\right). \]

For Model (2) the same formulas hold with \(x_i\) replaced by \(x_i-\bar x\), \(\alpha\) by \(\alpha^{*}\) and \(\beta\) by \(\beta^{*}\).

The geometric source of the mixing problem is now visible. Conditional on \(\beta\), the mean of \(\alpha\) is \(\bar y - \beta\bar x\): when \(\bar x \neq 0\), any move in \(\beta\) forces a deterministic shift in the conditional mean of \(\alpha\) of size \(-\bar x\,\Delta\beta\). The chain therefore walks along the ridge \(\alpha + \beta\bar x = \bar y\) one tiny step at a time. Under (2), \(\bar x\) is replaced by \(\overline{x-\bar x}=0\), so the conditional mean of \(\alpha^{*}\) is just \(\bar y\), independent of \(\beta^{*}\). That is exactly why (2) mixes so well.

3 Simulating the data

n     <- 100
alpha <- 1
beta  <- 3
sigma <- 1

x  <- rnorm(n, mean = 5, sd = 1)
y  <- alpha + beta * x + rnorm(n, sd = sigma)
xc <- x - mean(x)            # centered covariate

c(xbar = mean(x), ybar = mean(y))
##      xbar      ybar 
##  5.149703 16.491752

4 A single-site Gibbs sampler

One function does the job; passing x runs (1), passing x - mean(x) runs (2).

gibbs_lm_scalar <- function(y, x, M = 5000, burn = 1000,
                            init = c(0, 0, 1)) {
  n   <- length(y)
  sx  <- sum(x)
  sx2 <- sum(x^2)
  sxy <- sum(x * y)
  sy  <- sum(y)

  a <- init[1]; b <- init[2]; s2 <- init[3]

  keep <- matrix(NA_real_, M + burn, 3,
                 dimnames = list(NULL, c("alpha", "beta", "sigma2")))

  for (i in seq_len(M + burn)) {
    # alpha | beta, sigma^2  ~ N( (sum(y) - beta*sum(x))/n , sigma^2/n )
    mu_a <- (sy - b * sx) / n
    a    <- rnorm(1, mean = mu_a, sd = sqrt(s2 / n))

    # beta  | alpha, sigma^2  ~ N( (sum(x*y) - alpha*sum(x))/sum(x^2), sigma^2/sum(x^2) )
    mu_b <- (sxy - a * sx) / sx2
    b    <- rnorm(1, mean = mu_b, sd = sqrt(s2 / sx2))

    # sigma^2 | alpha, beta   ~ IG( n/2 , sum((y - alpha - beta*x)^2)/2 )
    rss <- sum((y - a - b * x)^2)
    s2  <- 1 / rgamma(1, shape = n / 2, rate = rss / 2)

    keep[i, ] <- c(a, b, s2)
  }
  keep[(burn + 1):(burn + M), , drop = FALSE]
}

5 Running both samplers

M    <- 5000
burn <- 1000

draws1 <- gibbs_lm_scalar(y, x,  M = M, burn = burn, init = c(0, 0, 1))  # Model (1)
draws2 <- gibbs_lm_scalar(y, xc, M = M, burn = burn, init = c(0, 0, 1))  # Model (2)

Posterior summaries:

summ <- function(z) c(mean = mean(z), sd = sd(z),
                      q025 = quantile(z, 0.025, names = FALSE),
                      q975 = quantile(z, 0.975, names = FALSE))

tab <- rbind(
  alpha       = summ(draws1[, "alpha"]),
  beta        = summ(draws1[, "beta"]),
  sigma2_m1   = summ(draws1[, "sigma2"]),
  alpha_star  = summ(draws2[, "alpha"]),
  beta_star   = summ(draws2[, "beta"]),
  sigma2_m2   = summ(draws2[, "sigma2"])
)
round(tab, 4)
##               mean     sd    q025    q975
## alpha       0.1531 0.6046 -1.0402  1.4023
## beta        3.1726 0.1154  2.9342  3.4010
## sigma2_m1   1.0938 0.1585  0.8258  1.4474
## alpha_star 16.4919 0.1047 16.2870 16.6954
## beta_star   3.1607 0.1120  2.9351  3.3769
## sigma2_m2   1.0899 0.1608  0.8193  1.4530

As expected, \(\widehat\alpha \approx \widehat{\alpha^{*}} - \widehat{\beta^{*}}\bar x\) and \(\widehat\beta = \widehat{\beta^{*}}\).

6 Posterior correlation

cor1 <- cor(draws1[, "alpha"], draws1[, "beta"])
cor2 <- cor(draws2[, "alpha"], draws2[, "beta"])
c(`cor(alpha, beta)`   = round(cor1, 4),
  `cor(alpha*, beta*)` = round(cor2, 4))
##   cor(alpha, beta) cor(alpha*, beta*) 
##            -0.9854             0.0038

With \(\bar x \approx 5 \gg 0\), \(\alpha\) and \(\beta\) are very strongly negatively correlated in the posterior of Model (1) (cor \(\approx -0.98\)), whereas \(\alpha^{*}\) and \(\beta^{*}\) are nearly uncorrelated.

7 Diagnostics

par(mfrow = c(3, 2), mar = c(4, 4, 3, 1))

plot(draws1[, "alpha"], type = "l", col = "steelblue",
     main = expression("Model (1): trace of " * alpha),
     xlab = "iteration", ylab = expression(alpha))
plot(draws2[, "alpha"], type = "l", col = "darkorange",
     main = expression("Model (2): trace of " * alpha^"*"),
     xlab = "iteration", ylab = expression(alpha^"*"))

plot(draws1[, "beta"], type = "l", col = "steelblue",
     main = expression("Model (1): trace of " * beta),
     xlab = "iteration", ylab = expression(beta))
plot(draws2[, "beta"], type = "l", col = "darkorange",
     main = expression("Model (2): trace of " * beta^"*"),
     xlab = "iteration", ylab = expression(beta^"*"))

acf(draws1[, "beta"], main = expression("ACF of " * beta * " (Model 1)"),
    col = "steelblue", lwd = 2, lag.max = 100)
acf(draws2[, "beta"], main = expression("ACF of " * beta^"*" * " (Model 2)"),
    col = "darkorange", lwd = 2, lag.max = 100)

par(mfrow = c(1, 2), mar = c(4, 4, 3, 1))

plot(draws1[, "alpha"], draws1[, "beta"], pch = 20, cex = 0.4,
     col = adjustcolor("steelblue", alpha.f = 0.4),
     xlab = expression(alpha), ylab = expression(beta),
     main = sprintf("Model (1):  cor = %.3f", cor1))

plot(draws2[, "alpha"], draws2[, "beta"], pch = 20, cex = 0.4,
     col = adjustcolor("darkorange", alpha.f = 0.4),
     xlab = expression(alpha^"*"), ylab = expression(beta^"*"),
     main = sprintf("Model (2):  cor = %.3f", cor2))

In Model (1) the chain crawls along the narrow ridge — visible both in the trace plots (slow, drifting excursions) and in the ACF (decays very slowly). In Model (2) traces look like white noise and the ACF collapses at lag 1.

8 Effective sample size

We use coda::effectiveSize (spectral-density estimator) and, as a cross-check, Geyer’s (1992) initial-positive-sequence estimator:

# Geyer (1992) initial positive sequence: pair lags 2m + (2m+1), keep summing
# while the paired sum is positive. Guaranteed conservative (does not
# underestimate the integrated autocorrelation time).
ess_geyer <- function(x) {
  n   <- length(x)
  rho <- as.numeric(acf(x, plot = FALSE,
                        lag.max = min(n - 1, 1000))$acf)[-1]
  K   <- floor(length(rho) / 2)
  Gamma <- rho[2 * (1:K) - 1] + rho[2 * (1:K)]
  stop_at <- which(Gamma <= 0)[1]
  if (is.na(stop_at)) stop_at <- K + 1
  m_max <- stop_at - 1                       # number of positive paired sums
  tau   <- if (m_max < 1) 1 else 1 + 2 * sum(rho[1:(2 * m_max)])
  n / max(tau, 1)
}

if (!requireNamespace("coda", quietly = TRUE)) {
  install.packages("coda", repos = "https://cloud.r-project.org")
}
ess_coda <- function(z) as.numeric(coda::effectiveSize(z))

ess_tab <- rbind(
  `Model (1) (coda)`  = c(alpha  = ess_coda(draws1[, "alpha"]),
                          beta   = ess_coda(draws1[, "beta"]),
                          sigma2 = ess_coda(draws1[, "sigma2"])),
  `Model (2) (coda)`  = c(alpha  = ess_coda(draws2[, "alpha"]),
                          beta   = ess_coda(draws2[, "beta"]),
                          sigma2 = ess_coda(draws2[, "sigma2"])),
  `Model (1) (Geyer)` = c(alpha  = ess_geyer(draws1[, "alpha"]),
                          beta   = ess_geyer(draws1[, "beta"]),
                          sigma2 = ess_geyer(draws1[, "sigma2"])),
  `Model (2) (Geyer)` = c(alpha  = ess_geyer(draws2[, "alpha"]),
                          beta   = ess_geyer(draws2[, "beta"]),
                          sigma2 = ess_geyer(draws2[, "sigma2"]))
)
round(ess_tab, 1)
##                    alpha   beta sigma2
## Model (1) (coda)    74.4   73.7 4554.1
## Model (2) (coda)  5026.9 4662.3 4714.4
## Model (1) (Geyer)   75.8   75.7 3773.4
## Model (2) (Geyer) 5000.0 4579.1 4728.9

Out of \(M = 5000\) post-burn-in draws, Model (2) yields ESS close to \(M\) for \(\alpha^{*}\) and \(\beta^{*}\), while Model (1) yields only a small handful of effective draws for \(\alpha\) and \(\beta\). The ESS of \(\sigma^2\) is essentially \(M\) in both — it isn’t entangled with the ridge.

9 Why a naive ESS estimator can mislead

A simple estimator that stops at the first negative autocorrelation and sums only the positive \(\rho_k\)’s can severely overestimate ESS for a very sticky chain: many early \(\rho_k\) are noisy estimates that dip negative purely by sampling error, the sum is truncated too early, and the resulting \(\hat\tau\) comes out far below the true integrated autocorrelation time. Geyer’s initial-positive-sequence estimator (used above) avoids this by summing paired lags \(\rho_{2m}+\rho_{2m+1}\) — a function known to be positive and monotone in \(m\) for a reversible chain.

10 Conclusion

Centering the covariate makes the columns of the design matrix orthogonal, so the conditional mean of \(\alpha^{*}\) given \(\beta^{*}\) no longer depends on \(\beta^{*}\) (and vice versa). Single-site Gibbs under (1) crawls along a steep ridge whose slope is \(-\bar x\), while under (2) it samples essentially i.i.d. The reparametrization costs nothing — recover \((\alpha,\beta)\) via \(\alpha = \alpha^{*} - \beta^{*}\bar x\), \(\beta = \beta^{*}\) — but the gain in MCMC efficiency is dramatic.