A few functions

post = function(theta){
  exp(-0.5*(prod(theta)^2+sum(theta^2)-8*sum(theta)))
}
gradient = function(theta){
  c(-theta[1]*theta[2]^2+theta[1]-4,
    -theta[2]*theta[1]^2+theta[2]-4)
}

leapfrog = function(theta,p,eps,iM){
  p     = p     + (eps/2)*gradient(theta)
  theta = theta + eps*iM%*%p
  p     = p     +  (eps/2)*gradient(theta)
  return(list(theta=theta,p=p))
}

A posterior density \(\pi(\theta)\)

The objective is to sample from a posterior density \[ \pi(\theta) = \kappa \exp\left\{-0.5(\theta_1^2\theta_2^2+\theta_1^2+\theta_2^2-8\theta_1-8\theta_2\right\}, \] where \(\kappa\) is the proportionality constant.

ngrid = 200
th1 = seq(-3,7,length=ngrid)
th2 = seq(-3,7,length=ngrid)
f = matrix(0,ngrid,ngrid)
for (i in 1:ngrid)
 for (j in 1:ngrid)
   f[i,j] = post(c(th1[i],th2[j]))
   
contour(th1,th2,f,drawlabels=FALSE,xlab=expression(theta[1]),ylab=expression(theta[2]))
title("Target distribution")

Log posterior and gradient

It can be easily shown that \[ \log \pi(\theta) = \log\kappa - 0.5\theta_1^2\theta_2^2+0.5\theta_1^2+0.5\theta_2^2-4\theta_1-4\theta_2, \] and that the gradient of \(\log \pi(\theta)\) is \[ \nabla_\theta \log \pi(\theta) = \left( \begin{array}{c} \frac{\partial\log \pi(\theta)}{\partial \theta_1}\\ \frac{\partial\log \pi(\theta)}{\partial \theta_2}\\ \end{array} \right) = \left( \begin{array}{c} -\theta_1\theta_2^2+\theta_1-4\\ -\theta_2\theta_1^2+\theta_2-4 \end{array} \right). \]

Random walk Metropolis-Hastings algorithm

Here we sample \(\theta^*\) from the random walk proposal \(N(\theta^{(i)},\sigma_\theta^2)\), where \(\sigma_\theta\) is the (local) standard deviation of the proposal density. Then, \(\theta^{(i+1)}=\theta^*\) with probability \[ \alpha=\min\left\{1,\frac{\pi(\theta^*)}{\pi(\theta^{(i)})}\right\}, \] and \(\theta^{(i+1)}=\theta^{(i)}\) with probability \(1-\alpha\).

sd.th  = 0.1
burnin = 10000
N      = 10000
niter  = burnin + N
thetas.rwmh = matrix(0,niter,2)
theta = c(0,0)
set.seed(32425)
for (iter in 1:niter){
  theta.new = rnorm(2,theta,sd.th)
  accept = min(1,post(theta.new)/post(theta))
  if (runif(1)<accept){
    theta = theta.new
  }
  thetas.rwmh[iter,] = theta
}

Hamiltonian MC algorithm

Samuel Thomas and Wanzhu Tu (2020) start Section 3 of their Learning HMC in R paper (https://arxiv.org/abs/2006.16194) by saying that:

HMC improves the efficiency of MH by employing a guided proposal generation scheme. More specifically, HMC uses the gradient of the log posterior to direct the Markov chain towards regions of higher posterior density, where most samples are taken. As a result, a well-tuned HMC chain will accept proposals at a much higher rate than the traditional MH algorithm.

Then in Section 3.1 they say that:

To generate \(\theta\) in a region of high posterior density, one needs to sample \(\theta\) in the region corresponding to the lower values of \(-\log \pi(\theta)\); the region can be reached with the guidance of the gradient of \(-\log \pi(\theta)\). In a sense, the approach is analogous to the movement of a hypothetical object on a frictionless curve, where the object traverses and lingers at the bottom of the valley while occasionally visiting the higher grounds on both sides. In classical mechanics, such movements are described by the Hamiltonian equations, where the exchanges of kinetic and potential energy dictate the object’s location at any given moment. In a Hamiltonian system, the horizontal and vertical positions are given by \((\theta,p)\). In MCMC, we are interested in \(\theta\). The parameter \(p\), which is often referred to as the momentum, is an auxiliary quantity that we use to simulate \(\theta\) under the Hamiltonian equations.

They continue by saying that:

We write the Hamiltonian function as \(H(\theta,p)\), which consists of potential energy \(U(\theta)\) and kinetic energy \(K(p)\): \[ H(\theta,p) = U(\theta) + K(p), \ \ \ \mbox{where} \ p \ \mbox{and} \ \theta \in {\rm I\!R}^k, \] where, in MCMC problems, the \(U(\theta) = -\log \pi(\theta)\), while for momentum \(p \sim N_k(0,M)\). Therefore, \[ H(\theta,p) = -\log \pi(\theta) + 0.5p^TM^{-1}p, \] such that, over time, HMC travels on trajectories that are governed by the following first-order differential equations, known as the Hamiltonian equations: \[\begin{eqnarray*} \frac{dp}{dt} &=& - \frac{\partial H(\theta,p)}{\partial \theta} = -\frac{\partial U(\theta)}{\partial \theta} = \nabla_\theta \log \pi(\theta),\\ \frac{d\theta}{dt} &=& \frac{\partial H(\theta,p)}{\partial p} = \frac{\partial K(p)}{\partial p} = M^{-1}p. \end{eqnarray*}\] where \(\nabla_\theta \log \pi(\theta)\) is the gradient of the log posterior density. A solution to the Hamiltonian equations is a function that defines the path of \((\theta,p)\) from which specific values of \(\theta\) could be sampled. Within an MCMC iteration, we sample a value \(\theta\) from this path. The randomness of the MCMC samples comes from the momentum \(p \sim N(0,M)\) and the specific \(\theta\) value we choose. Solving the Hamiltonian equations becomes a central problem, which is usually tackled by Euler’s discretization method. In HMC, the leapfrog algorithm is the alternative to Euler’s method that uses a discrete step size \(\epsilon\) individually for \(p\) and \(\theta\), with a full step \(\epsilon\) in \(\theta\)between two half-steps \(\epsilon/2\) for \(p\), \[\begin{eqnarray*} p(t+\epsilon/2) &=& p(t) + (\epsilon/2)\nabla_\theta \log \pi(\theta(t)),\\ \theta(t+\epsilon) &=& \theta(t) + \epsilon M^{-1}p(t+\epsilon/2),\\ p(t+\epsilon) &=& p(t+\epsilon/2) + (\epsilon/2)\nabla_\theta \log \pi(\theta(t+\epsilon)). \end{eqnarray*}\] For HMC, multiple leapfrog steps are typically required to move a sufficient distance to the next proposal. As with other valid MCMC algorithms, HMC’s transition probability is designed to meet the theoretical requirements for detailed balance and reversibility. These conditions ensure that our HMC samples provide a valid representation of the posterior distribution.

They also discuss the choices of the step size \(\epsilon\), the number of leapfrog steps \(L\) and the covariance matrix \(M\):

The efficiency of an HMC algorithm can be improved through parameter tuning and reparameterization. HMC tuning involves selection and adjustment of the various HMC parameters. Two parameters that need to be specified are the step size \(\epsilon\) and the number of leapfrog steps \(L\). Elements in the covariance matrix \(M\) may also be adjusted from the default identity matrix for efficiency improvement. It is generally a good practice to set \(\epsilon\) to a smaller value relative to the magnitude of the parameter of interest. A smaller \(\epsilon\) results in closer approximations and thus higher acceptance rates. But a small \(\epsilon\) must be coupled with a large \(L\) to ensure the trajectory length \(\epsilon L\) is large enough to move the simulated parameter to a distant point in the distribution. On the other hand, if \(\epsilon L\) is too large the trajectory is likely to circle back, causing waste in simulation. To tune \(\epsilon\) and \(L\) is to find the right combinations of these values, which are usually chosen via monitoring the acceptance rate. Neal et al. (2011) suggested an optimal acceptance rate is approximately 65%. At the same time, it is often helpful to examine the trace plots of the MCMC samples for signs of autocorrelation. Slow-moving chains with stronger autocorrelation often indicate insufficient \(\epsilon L\). While \(\epsilon\) and \(L\) can be tuned jointly, most analysts choose to select the step size first, then under a given step size, they fine-tune the number of steps per leapfrog \(L\).

There are lots of books and papers about Hamiltonian MC methods. I recommend a few here:

For more depth and breath on HMC schemes, I recommend:

A few more references appear in my STAN/RSTAN illustrative note at http://hedibert.org/wp-content/uploads/2021/02/stan-rstan-examples.html.

The generic HMC algorithm

  • Start at \(\theta^{(0)}\)

  • For \(i=1,\ldots,N\)

    • Draw \(p \sim N(0,M)\)

    • \(\theta^* \leftarrow \theta^{(i-1)}\)

    • \(p^* \leftarrow p\)

    • For \(j=1,\ldots,L\)

      • \((p^*,\theta^*) \leftarrow\) leapfrog(\(\theta^*\), \(p^*\), \(\epsilon\), \(M\))
    • \(\theta^{(i)} = \theta^*\) with probability \[ \alpha = \min\left\{\frac{\pi(\theta^*)}{\pi(\theta^{(i-1)})} \frac{\exp\{-0.5p^{*T}M^{-1}p^*\}}{\exp\{-0.5p^TM^{-1}p\}} \right\} \] where leapfrog\((\theta,p,\epsilon,M)\) is a three-step routine that takes \((\theta,p,\epsilon,M)\) as input, runs the following three steps, \[\begin{eqnarray*} p &\leftarrow& p + \frac{\epsilon}{2}\nabla_\theta \log \pi(\theta) \\ \theta &\leftarrow& \theta + \epsilon M^{-1}p\\ p &\leftarrow& p + \frac{\epsilon}{2}\nabla_\theta \log \pi(\theta) \end{eqnarray*}\] and produces \((p,\theta)\) as output.

set.seed(123456)
M      = diag(1,2)
iM     = solve(M)
tcM    = t(chol(M))
L      = 100
eps    = 0.001
theta  = c(0,0)
niter  = burnin + N
thetas.hmc = matrix(0,niter,2)
for (iter in 1:niter){
  p      = tcM%*%rnorm(2)
  theta1 = theta
  p1     = p
  for (i in 1:L){
    run = leapfrog(theta1,p1,eps,iM)
    p1  = run$p
    theta1 = run$theta
  }
  term1 = post(theta1)/post(theta)
  term2 = exp(-0.5*t(p1)%*%iM%*%p1)/exp(-0.5*t(p)%*%iM%*%p)
  accept = min(1,term1*term2)
  if (runif(1)<accept){
    theta = theta1
    p     = - p1
  }
  thetas.hmc[iter,]  = theta
}

Comparison

par(mfrow=c(2,6))
for (i in 1:2){
  ts.plot(thetas.rwmh[,i],xlab="Iterations",ylab=paste("theta",i,sep=""),main="RW-MH")
  acf(thetas.rwmh[,i],main="")
  hist(thetas.rwmh[,i],prob=TRUE,main="")
}
for (i in 1:2){
  ts.plot(thetas.hmc[,i],xlab="Iterations",ylab=paste("theta",i,sep=""),main="HMC")
  acf(thetas.hmc[,i],main="")
  hist(thetas.hmc[,i],prob=TRUE,main="")
}

More comparisons

par(mfrow=c(1,2))

plot(thetas.rwmh,xlim=range(th1),ylim=range(th2),pch=16,col=grey(0.8),
     xlab=expression(theta[1]),ylab=expression(theta[2]))
contour(th1,th2,f,drawlabels=FALSE,add=TRUE)
title("Random walk Metropolis-Hastings")

plot(thetas.hmc,xlim=range(th1),ylim=range(th2),pch=16,col=grey(0.8),
     xlab=expression(theta[1]),ylab=expression(theta[2]))
contour(th1,th2,f,drawlabels=FALSE,add=TRUE)
title("Hamiltonian Monte Carlo")