/* There is often a need to turn values of continuous data into categories. SAS has many ways to do this and the examples below show a few of them. Our data consist of a person ID, sex, age, and income where age and income are continuous variables that we want to categorize. */ options ls=90; filename foo url "http://www.uvm.edu/~abh/stat295/datasets/ages-incomes.dat"; data ageincome; infile foo; input id sex $ age income comma7. ; run; proc print; title "Ages and incomes"; format income dollar9. ; run; /* Categorizing data usually involves testing for a specific value or range of values, then assigning a value to a new variable based on the results of the test. Another method is to use user-defined formats to create new variables from the formatted values of an existing variable. Another is a clever use of the results of logical expressions. */ data categories; set ageincome; /* Here we test for specific ranges of age and income values and make assignments to new variables. */ length agegroup $ 18 incgroup $ 6 ; if age lt 40 then agegroup = " Less than 40"; else if (40 le age lt 65) then agegroup = "40 to less than 65"; else if age ge 65 then agegroup = "65 and over"; if income <= 30000 then incgroup = "Lower"; else if (30000 < income <= 80000) then incgroup = "Middle"; else if income > 80000 then incgroup = "Upper"; run; /* Perhaps you noticed the space before the value " Less than 40" and are wondering why it is there. SAS likes to order the values of character variables in alphabetical order and the space character comes before any numbers in the sort order. Without the leading space in the format, SAS would order the "Less than 40" group after the "65 and over" group. */ proc print; title "Categorizing ages and incomes"; format income dollar9. ; run; /* We can instead create formats for age and income that contain the same text as in the assignment statements above, then associate these formats with the age and incomes variables. At the same time, let's create formats for numeric categories that will be used in other examples below. */ proc format; value agefmt low - <40 = " Less than 40" 40 - < 65 = "40 to less than 65" 65 - high = "65 and over"; value incfmt low - 30000 = "Lower" 30000< - 80000 = "Middle" 80000< - high = "Upper"; value age2fmt 1 = " Less than 40" 2 = "40 to less than 65" 3 = "65 and over"; value inc2fmt 1 = "Lower" 2 = "Middle" 3 = "Upper"; run; proc print data=ageincome; format age agefmt. income incfmt. ; title "Categorizing using a pre-defined format"; run; /* While it appears that we have categorized age and income by using a format, the actual values of AGE and INCOME have not changed. So if we were to use AGE and INCOME in most procedures, their stored values would be used in any calculations. Some procedures allow you to use the formatted values of a variable in the procedure instead of the internally stored values through the use of the order= option. Compare the results of the following two FREQ procedures. The first uses the internally stored values (the default) and the second uses the formatted values. */ proc freq data=ageincome; table age income; format income dollar9. ; title "Frequency counts for age and income"; run; proc freq data=ageincome order=formatted; table age income; format age agefmt. income incfmt. ; run; /* Although using the order= option in many procedures is a handy way of using formatted values, we can also create a new variable using the previously created format. */ data ageincome2; set ageincome; agegroup = put(age,agefmt.); incgroup = put(income,incfmt.); run; proc print data=ageincome2; format income dollar9. ; title "Categorizing age and income by creating new variables from pre-defined formats"; run; /* It is also possible to categorize variables in a single assignment statement by using logical expressions. Logical expressions evaluate to either 0 or 1, and when used with a multiplier in front of each logical expression, result in a numeric category that can then be formatted, if desired. The following code creates numeric category variables for both age and income, ranging from 1 to 3, and are associated with formats previously created in the format procedure above. */ data ageincome3; set ageincome; agegroup = 1*(age lt 40) + 2*(40 le age lt 65) + 3*(age ge 65); incgroup = 1*(income <= 30000) + 2*(30000 < income <= 80000) + 3*(income > 80000); run; proc print; format income dollar9. ; title "Categories created by logical expressions"; run; proc print; title "Categories created by logical expressions"; format agegroup age2fmt. incgroup inc2fmt. income dollar9. ; run;