Generating Data with a Specified Correlation

David C. Howell

It is quite easy to generate a set of data that represents a sample from a population a specified correlation coefficient of r. I don't have the time right now to write out a specific program. However, the basic steps are very simple. The program will not generate a data set with exactly the correlation you specify. Instead it will draw data from a population whose correlation parameter (ρ) is that correlation.

• Use the normal random number function available in almost all software to generate two random variables (X and Y).
• Standardize these variables to mean = 0, sd = 1.
• Calculate a = r/sqrt(1-r2).
• Calculate Z = a*X + Y.
• Adjust the means and variances of X and Z to what you want them to be by simple linear transformations--(e.g., Xnew = Xold*NewSD + NewMean).
• Now the correlation between X and Z will be r.
• The mean of z will be 0.00, and its stand deviation will be sqrt(a2 + 1).
• If you don't standardize the variables I would assume that the resulting r will come from a population where rho = 0, but I haven't worked this out. If anyone knows for sure, I'd appreciate hearing.
• I got this idea from an electronic message from Marco Welton, at University College Cork, Ireland, but I'm sure that it is not original with him. If you want a program in SPSS or R that will generate a data set with an exact correlation matrix, go to CorrGen2.html. That program will handle a matrix with many variables, not just two.

Send mail to: David.Howell@uvm.edu)
dch: