Lab 8
Cluster Analysis of Geographic Data
NR245
- Open
up S Plus. Open the object browser with the
button. Load the BG_GF_CENSUS2 table you
used last in lab 6.
- Now
we’ll try a simple cluster analysis. Click statistics>>Cluster
Analysis>>K means
.
. You should now get an interface where you can choose the parameters for
the model. We’ll start with a cluster analysis based on four land cover
variables, percent tree (P.coarse.veg)), percent grass (p.fine veg),
percent pavement (p.pavement) and stream density (H2Odens). You can do
this by control-clicking the variables you want like so. Now select the
number of clusters as 3. Keep the number of Max iterations at 10. The
iteration refers to the fact that k means works by classifying
observations into groups iteratively, by calculating group centroids, then
assigning observations to the groups with the nearest centroid, then
recalculating centroids. It does this, based on n-dimensional distance, by
calculating the centroid of each group and assigning each observation to a
group with the closest centroid. It must do this iteratively because it
can’t know where the centroid of a group is until it knows who’s in the
group—a bit of a paradox. So it calculates the centroid based on a current
group membership (which might start off being quite wrong), then
recalculates by reassigning observations to groups based on the new
centroids, and so on. Later on you can play with changing this parameter.
The more iterations the more refined group membership estimates become,
which can become important when you have many dimensions. Now, from the model output (should be
in a report), say how many observations there are in each cluster. Also
paste the small table showing mean value by cluster of percent forest and
med density residential into your results. Does it look like each cluster
really has very different combos of values?
STEPS 3 and 4 are optional, for those who can’t
do the SPSS steps below and count for extra credit.Step 5 is just pointing you
to information on Partitioning on Medoids and is only for your information
- Now
try some different combinations of k-means and store the results. This
time, each time you run one, under the Results tab check the “cluster
membership” butt and choose to “save in” your BG-GFCensus2 layer. First
try the same variables as in the last step but with 6 clusters. Then try
using a mix of land cover and socio-economic variables instead of just
land cover. Choose the p.coarse.veg, p.pavement, MED.HH.INC, and P.SFDH
(percent single family detached homes) and MED.YR.ALL (median year of
housing construction). Try this with 3 and 6 clusters. As you do these,
keeps notes on what you did for each clustering and the order you did them
in--you'll need that for the next step
- Now
look at the table for your data in SPlus—you should see a bunch of new
columns at the end for cluster membership. Right click on each column
heading and rename them based on the order you did them (the first one you
did are to the left, the last ones
to the right). Call the field
headings something that allows you to recognize them, like for instance,
K3SocEco for k-means 3-cluster analysis with both socio-economic and
ecological variables. Then go to export data>>to file. Under the
filter tab choose “preview column names and choose only your cluster
membership columns and the BKG_KEY field. Then output it to a new dbf
table. Close S plus and open Arc Map. Load up your BGGFCensus layer and do
a tabular join with the new dbf table. Then plot out three of the
different clustering by block group using categorical mapping and
screencapture each, describing what variables were used to create each
clustering in the caption.
- In the
past we used to use the Partitioning Around Medoids function in S-Plus for
this exercise. However, as of release 8.0, the Silhouette plot in PAM (the
most important output) has an error and it doesn’t work. To
learn about how PAM worked click here for the old part of the lab. Instead, we’ll do clustering in SPSS.
First, you’ll export your BG_GF_Census2 file to SPSS format. Click
file>>export data>to file and save that file as file type SPSS
data file—windows OS.
- Now
we’ll do clustering in SPSS using their snazzy “two step cluster
analysis.” This method is better than K means for a number of reasons: 1.
can create clusters based on categorical and numberic data; 2) can
automatically select number of clusters based on optimization if you want,
3) is efficient with large data sets; 4) uses multi-model inferential
statistics to select best number of clusters (AIC or BIC). To use it open
SPSS. Then go to file>>open>>data and browse to the layer you
just saved. On the resulting screen go to
Analysis>>Classify>>Two step cluster. Note that very good support on using
clustering in SPSS can be found at http://www2.chass.ncsu.edu/garson/PA765/cluster.htm.
- Let’s
start with a simple clustering algorithm: let’s cluster based on percent
coarse veg, percent fine veg, and income by block group. Shift click those
variables in the left window and click the right arrow to bring them into
the “continuous variables” box. Check Log-likelihood as the distance
measure, choose “determine automatically” as the number of clusters (15
max) and choose Akaike’s Information Criterion under the clustering
criterion (this is less conservative than BIC and will generally result in
more classes). It should look something like:

Then click “output” and check the first three boxes under
“statistics”. Hit continue and then click “plots”. Check “within cluster
percentage chart, cluster pie chart and rank variable importance. Under Rank
variables check the “by cluster” radio button and under the importance measure
check the chi-square or t-test radio button. Then check the confidence level
box (should say 95 under percentage). Click continue and then at the original
interface click OK.
- Now
let’s look at the output. First, look at the auto-clustering table. This
is how it chooses the optimum number of clusters. Basically it’s looking
for the number that results in the best combination of low (but not
necessarily the lowest) AIC, high ratio of distance measures and high
ratio of AIC changes. If you look to the next table (cluster distribution)
it will show you that four clusters were chosen. Going back to the first
table, report what the AIC, Ratio of AIC change and ratio of distance
measures were for the four cluster model. Next look at the “cluster
distribution” table and report how many observations there were per
cluster. Then check out the cluster profiles table. The first thing you
might notice is that the standard deviation numbers appear “starred out.”
This is because there are too many decimal paces. So, we’ll need to edit
the table a little. Double click on the box surrounding the table. Then
control-click all the standard deviation cells so they are highlighted
like this.

Then click format>>cell
properties and under decimals choose 3. Now you should have a table with all
the numbers showing. Take a screencapture (note that in SPSS you don’t need to
use Snagit to screencapture. Just right click on an object and click “copy” and
then paste into Word) and interpret what the mean and standard deviation is
telling you in general terms for each cluster. Based on this, report and
discuss the difference between cluster 1 and cluster 4. In what way are they
distinctly different? Next, screencapture in the simulataneous 95% confidence
interval for means plot for P.fineveg. Interpret what this plot is saying about
what is the same and different and how you know from it. Finally, look at the
clusterwise importance graphs. Screencapture the one for P.fine.veg and
interpret it. What do the dotted lines signify and which values cross them?
What is the significance of a negative vs. positive significant t
statistic. Note also that you should
have the cluster membership for each observation now stored in the table as the
last column, called something like “TSC_XXXX.” You can change the name of that
by clicking on the “variable view” tab at the lower left and scrolling down to
the last variable. There you can change the name. Call it cluster1. Then go
back to the “data view” tab.
- Now
we’ll add some more variables and see what happens. Do another two step
cluster analysis. Do everything the same except change the variables. Get
rid of p.fine.veg, but keep MED.HH.INC and P.coarse.veg. Also add P.SFDH,
d2down, and Robb05. Then, add “DESC.15” (that’s PRIZM 15 with names) to
the “categorical variable” list. Under “plots” keep everything the same,
except change “rank variable from ‘by cluster’ to ‘by variable.’ Hence,
we’ll be clustering based on continuous and categorical variables. Now
look at the output. As you’ll see, the number of clusters is five. Report
what the ratio of distance measures and ratio of AIC changes is for that
number of clusters (see the auto-clustering table). Next report what
percentage of observations is in each class (cluster distribution table).
Next look at the cluster profiles table. Again change the number of
decimal places for the standard deviation cells so you can see them,
instead of stars. However, you’ll note that you can’t really see all the
variables there because the table is so wide, so you’ll need to pivot it.
Double click on the frame around the table. That should bring up a new
window that says SPSS Pivot Table. Then click Pivot>>transpose rows
and columns then File>>close. (if that window doesn’t open when you
double click, then double click on the table frame and from the menu above
it click Pivot>>transpose rows and columns. Now screencapture that
cluster profile table. Use this table to interpret the difference between
clusters 2 and 5. You’ll note that they have very similar incomes but are
different in many other ways. Explain those differences. Next, you’ll see
a Frequencies table that is specific to the PRIZM 15 classes. Pivot it,
screencapture and explain what it’s saying. Next scroll down to the
“within cluster percentage” plot. Screencapture and explain what each bar
means. Based on this graph alone (using the names in the legend for
guidance), what kind of areas would you say are characterized by cluster
1? Scroll down now to the “Within cluster variation” graph for
P.coarse.veg a screencapture it. Describe which cluster has the highest
coarse veg levels and which has the lowest. You can basically ignore the
“categorical variablewise importance” plots and scroll down to “Continuous
Variablewise Importance” plots. Look at the plot for Cluster 1 (note that
the last time these were organized by variable and not cluster).
Screencapture and describe what this plot seems to say about the characteristics
of cluster 1. Make sure to discuss the relevance of negative versus
positive t statistics. Does your assessment of cluster 1 appear consistent
with what the PRIZM results above suggested about cluster 1? Then go down
next to cluster 2. Report which variables are statistically significant
and which are not and how you know that. Based on what you’ve seen in this
output, come up with a name (kind of like the PRIZM names) for each of
your clusters.
- Now
export the table to your geodatabase. From the table interface (not the
results interface) click File>>save as. Choose the output type as
dbaseIV(dbf). Then click on the variables button. Click drop all, then
check BKG.KEY, cluster1 and cluster2. Click continue and then save. Open
ArcMap and load up your latest BG_GF_Census layer. Then load the dbf you
just saved and do a tabular join using BKG_KEY field. First plot out and
screencapture (using categorical mapping) the cluster 1 variable. Then do
the same for the cluster 2 variable, only in the symbology window type
under the “label” field the name you gave for each class. Then
screencapture the map with the legend.
- Assemble
your text and screencaptures and make a PDF to upload.