Generating Data with a Specified Correlation

David C. Howell

bar
bar

It is quite easy to generate a set of data that represents a sample from a population a specified correlation coefficient of r. I don't have the time right now to write out a specific program. However, the basic steps are very simple. The program will not generate a data set with exactly the correlation you specify. Instead it will draw data from a population whose correlation parameter (ρ) is that correlation.

  • Use the normal random number function available in almost all software to generate two random variables (X and Y).
  • Standardize these variables to mean = 0, sd = 1.
  • Calculate a = r/sqrt(1-r2).
  • Calculate Z = a*X + Y.
  • Adjust the means and variances of X and Z to what you want them to be by simple linear transformations--(e.g., Xnew = Xold*NewSD + NewMean).
  • Now the correlation between X and Z will be r.
  • The mean of z will be 0.00, and its stand deviation will be sqrt(a2 + 1).
  • If you don't standardize the variables I would assume that the resulting r will come from a population where rho = 0, but I haven't worked this out. If anyone knows for sure, I'd appreciate hearing.
  • I got this idea from an electronic message from Marco Welton, at University College Cork, Ireland, but I'm sure that it is not original with him. If you want a program in SPSS or R that will generate a data set with an exact correlation matrix, go to CorrGen2.html. That program will handle a matrix with many variables, not just two.

    bar
    bar

    Return to Dave Howell's Statistical Home Page  

    Planetary 
Cows Icon University of Vermont Home Page  



    Send mail to: David.Howell@uvm.edu)
    dch: