The following is a reply to a quiry on how to generate correlated data. It was sent by David Nichols, at SPSS, and represents the clearest statement I have seen in answer to this frequently asked question. I have copied his reply, and am making it available here, simply to save you having to search the archives of the edstat-l list. Don't give up when you come across "upper triangular Cholesky decomposition," read on.

I have included a specific implementation of this idea (written for SPSS), which you can look at to see what's going on. It generates data from a population where all the pairwise correlations are .50.

Date: Fri, 20 Sep 1996 20:44:10 -0400

Reply-To: nichols@spss.com

Originator: edstat-l@jse.stat.ncsu.edu

Sender: edstat-l@jse.stat.ncsu.edu

Precedence: bulk

From: nichols@spss.com (David Nichols)

To: Multiple recipients of list <edstat-l@jse.stat.ncsu.edu>

Subject: Re: generating correlated numbers

X-Comment: Statistics Education Discussion

In article <Santosh_Kumar-2009961451350001@medusa.cog.brown.edu>, Santosh Kumar <Santosh_Kumar@brown.edu> wrote:

hi: i need
to generate numbers for two variables that have a particular correlation
coefficient r. Is there an easy way to do this? (preferrably using
matlab, datadesk or spss).

To which David Nichols responded:

Do you mean that you want two variables created as if they were
sampled from a population with a given correlation, or such that
they have that precise value in the sample? Either case can be
handled. A general way to do this is to begin with (pseudo) random
numbers and use the property that for a set of uncorrelated or
uncorrelated in the population (as independent random numbers
would be) variables, a given correlation matrix can be imposed
by postmultiplying the data matrix X by the upper triangular Cholesky
decomposition of the correlation matrix R. For the case of two
variables, this has a simple scalar solution that can easily be
done in SPSS without having to deal with the MATRIX procedure.

Start with two variables created using the (pseudo) random normal
option. If you want the "drawn from a population with correlation
of r" version, skip the next step.

For a sample correlation of exactly r, take the two variables
and run them through the FACTOR procedure, using a PC (principal
components) extraction method, extracting both components, and
saving the scores to the data file. These saved scores would be
uncorrelated in the sample.

Let's say the two variables have been named (in either case above)
X and Y. To create the desired correlation, create a new Y as:

COMPUTE Y=X*r+Y*SQRT(1-r**2)

where r is the desired correlation value. X and Y will now have
either the exact correlation desired, or if you didn't do the
FACTOR step, if you do this a large number of times, the distribution
of correlations will be centered on r.

The more general version of this simply requires a matrix of variables
X to be postmultiplied by the Cholesky decomposition of R, the
desired correlation matrix. Assuming variables A to Z in an SPSS
data file, use

MATRIX.

GET X /VAR=A TO Z.

COMPUTE R={ }.

COMPUTE NEWX=X*CHOL(R).

where inside the curly brackets you define the structure of R.
NEWX can then be saved to a file if desired.

------------------------------------------------------------------------- ----

David Nichols Senior Support Statistician SPSS, Inc.

Phone: (312) 329-3684 Internet: nichols@spss.com Fax: (312) 329-3668

------------------------------------------------------------------------- ----