Yet another example of how to perform Bayesian learning in a traditionally difficult problem: learning about the correlation in a standard bivariate normal distribution. More specifically, let \(x_1,\ldots,x_n\) be a random sample from the bivariate normal with zero mean, unit variance and correlation \(\rho\) here \(x_i=(x_{i1},x_{i2})'\). The joint probability density function of \(x_1,\ldots,x_n\) , for \(x_i=(x_{i1},x_{i2})'\), is \[ p(x_1,\ldots,x_n|\rho) = \left(\frac{1}{2\pi\sqrt{1-\rho^2}}\right)^n\exp\left\{-\frac{1}{2(1-\rho^2)}\left[\sum_{i=1}^n (x_{i1}^2+x_{i2}^2)-2\rho\sum_{i=1}^n x_{i1}x_{i2}\right]\right\}. \] Notice that \(p(x_1,\ldots,x_n|\rho)\), as a function of \(\rho\), i.e. the likelihood function, is only a function of \(n\), \(\sum_{i=1}^n x_{i1}^2\), \(\sum_{i=1}^n x_{i2}^2\) and \(\sum_{i=1}^n x_{i1}x_{i2}\). These four quantities, called suficient statistics, translates into the \(2 \times 2\) matrix \(X'X\), where \(X=(x_1,\ldots,x_n)'\). This is a property of the bivariate Gaussian distribution. Should you select another bivariate distribution to model the pairs \((x_{i1},x_{i2})\), then you might have other sufficient statistics or none at all. The R package mvtnorm can be used to draw and pointwise evaluation from any bivariate normal distribution:
library("mvtnorm")
x = c(1,0)
m = c(0,0)
rho = 0.8
S = matrix(c(1,rho,rho,1),2,2)
dmvnorm(x,mean=m,sigma=S)
## [1] 0.06614273
Here we simulte \(n=14\) observations from the standard bivariate normal with correlation \(\rho=-0.7\).
# Simulated data
set.seed(12345)
r = -0.7
m = c(0,0)
S = matrix(c(1,r,r,1),2,2)
y = rmvnorm(14,mean=m,sigma=S)
par(mfrow=c(1,1))
plot(y,xlab="X",ylab="Y",main="")
title(paste("Simulated data\n Sample correlation=",round(cor(y)[1,2],3),sep=""))
This dataset corresponds to the heights (in cm) and weights (in kg) of 14 students from the Bayesian Learning course. I have excluded the 15th observation (myself) because we concluded it was an outlier.
x = matrix(c(178,70,175,80,184,82,200,89,183,80,173,65,185,70,162,70,190,85,
174,80,180,87,165,74,176,72,174,63,197,120),15,2,byrow=TRUE)
x = x[1:14,]
x[,1]=(x[,1]-mean(x[,1]))/sd(x[,1])
x[,2]=(x[,2]-mean(x[,2]))/sd(x[,2])
par(mfrow=c(1,1))
plot(x,xlab="Height (standardized)",ylab="Weight (Standardized)",main="")
title(paste("Real data\n Sample correlation=",round(cor(x)[1,2],3),sep=""))
Since \(\rho \in (-1,1)\) a natural non-informative prior is the \(U(-1,1)\) with density \(p(\rho)=1/2\) for \(\rho \in (-1,1)\). An alternative noninformative prior is the arc-sine prior, where \[ p(\rho) = \frac{1}{\pi} \times \frac{1}{\sqrt{(1-\rho^2)}}. \]
rhos=seq(-1,1,length=100)
plot(rhos,1/pi/sqrt(1-rhos^2),xlab=expression(rho),ylab="Prior density",lwd=2,type="l")
lines(rhos,rep(1/2,100),col=2,lwd=2)
The posterior for \(\rho\) is \[ p(\rho|x_1,\ldots,x_n) \propto p(\rho)\prod_{i=1}^n \frac{1}{2\pi\sqrt{1-\rho^2}}\exp\left\{-\frac{x_{i1}^2+x_{i2}^2-2\rho x_{i1}x_{i2}}{2(1-\rho^2)}\right\}, \] which has no known form, regardless of which prior one uses. We will use the uniform prior, \(p_u(\rho)\), as proposal in the SIR algorithm (see below) for both data sets and under both prior specifications.
M = 10000
N = M
rho = runif(M,-1,1)
w.y = matrix(0,M,2)
w.x = matrix(0,M,2)
for (i in 1:M){
Sigma = matrix(c(1,rho[i],rho[i],1),2,2)
prior = 1/sqrt(1+rho[i]^2)
w.y[i,1] = prod(dmvnorm(y,mean=m,sigma=Sigma))
w.y[i,2] = prod(dmvnorm(y,mean=m,sigma=Sigma))*prior
w.x[i,1] = prod(dmvnorm(x,mean=m,sigma=Sigma))
w.x[i,2] = prod(dmvnorm(x,mean=m,sigma=Sigma))*prior
}
rho.y = matrix(0,N,2)
rho.x = matrix(0,N,2)
for (i in 1:2){
rho.y[,i] = sample(rho,size=N,replace=TRUE,prob=w.y[,i])
rho.x[,i] = sample(rho,size=N,replace=TRUE,prob=w.x[,i])
}
par(mfrow=c(2,3),mai=c(0.55,0.55,0.2,0.2))
plot(y,xlab="",ylab="",main="Simulated data")
mtext(side=1,line=2,"y1")
mtext(side=2,line=2,"y2")
hist(rho.y[,1],breaks=30,prob=TRUE,xlab="",ylab="",main="",xlim=c(-1,1),ylim=c(0,4))
box()
title("From uniform prior")
mtext(side=1,line=2,expression(rho))
mtext(side=2,line=2,"Posterior density")
hist(rho.y[,2],breaks=30,prob=TRUE,xlab="",ylab="",main="",xlim=c(-1,1),ylim=c(0,4))
box()
title("From arc-sin prior")
mtext(side=1,line=2,expression(rho))
mtext(side=2,line=2,"Posterior density")
plot(x,xlab="",ylab="",main="Real data")
mtext(side=1,line=2,"x1=Height (standardized)")
mtext(side=2,line=2,"x2=Weight (Standardized)")
hist(rho.x,breaks=30,prob=TRUE,xlab="",ylab="",main="",xlim=c(-1,1),ylim=c(0,4))
box()
mtext(side=1,line=2,expression(rho))
mtext(side=2,line=2,"Posterior density")
hist(rho.x[,2],breaks=30,prob=TRUE,xlab="",ylab="",main="",xlim=c(-1,1),ylim=c(0,4))
box()
mtext(side=1,line=2,expression(rho))
mtext(side=2,line=2,"Posterior density")
We can then compute posterior summaries for \(\rho\) based on both data sets:
# Simulated data
round(apply(rho.y,2,quantile,c(0.025,0.5,0.975)),3)
## [,1] [,2]
## 2.5% -0.856 -0.855
## 50% -0.712 -0.710
## 97.5% -0.347 -0.331
# Real data
round(apply(rho.x,2,quantile,c(0.025,0.5,0.975)),3)
## [,1] [,2]
## 2.5% 0.102 0.081
## 50% 0.569 0.558
## 97.5% 0.784 0.783
The more curious or interested student can check out the paper by Fosdick and Raftery (2012) Estimating the Correlation in Bivariate Normal Data With Known Variances and Small Sample Sizes, The American Statistician, 66:1, 34-41. https://sites.stat.washington.edu/raftery/Research/PDF/Fosdick2012.pdf
For a non-Bayesian approach, see the paper by Azen and Reed (1973) Maximum Likelihood Estimation of Correlation between Variates Having Equal Coefficients of Variation, Technometrics, Vol. 15, No. 3, pp. 457-462. https://doi.org/10.2307/1266851