/* Reading from external data files using INFILE and INPUT statements While most people are introduced to inputting data into SAS by including the data in the SAS program, it is more common to have data in a pre-existing file external to SAS and to tell SAS where and how to read this data file. The "where" is done via the INFILE statement. The "how" is done via the INPUT statement which you are already familiar with. The INPUT statement is potentially the most complicated statement in the entire SAS language. It is the heart of SAS's ability to read just about any data file of any format in combination with the INFILE statement, and as we will see below, the INPUT statement can be remarkably simple or it can extremely complex. SAS reads data by using a "pointer" and the INPUT statement is the pointer control command. In the example below, the INPUT statement reads data via "list input" which is the simplest form of input. This assumes that the data file contains unaligned data, with adjacent values separated by at least one delimiter, usually a single space, but the delimiter can also be a tab character or comma. This form of the INPUT statement often works with data that is aligned in columns. */ data sept17_1; infile "Z:\public_html\stat295\datasets\sept17_1.dat"; input id age height weight; run; proc print data=sept17_1; title "Simple use of INFILE and INPUT statements"; run; /* When a data file is accessible over the Internet, we can use a FILENAME statement in addition to an INFILE statement to read data from that file over the net. The example below reads the same data file as before but uses the URL access method to read it. The FILENAME statement is to files as the LIBNAME statement is to directories. It defines a "nickname" for an existing file. Instead of specifying the actual file name in the INFILE statement, you specify the "nickname" for the file given by the FILENAME statement. If you are using SAS version 8, you must to use this FILENAME statement instead of the one in the example below. filename foo1 URL "http://www.uvm.edu:80/~abh/stat295/datasets/sept17_1.dat"; */ filename foo1 URL "http://www.uvm.edu/~abh/stat295/datasets/sept17_1.dat"; data sept17_1; infile foo1; input id age height weight; run; proc print data=sept17_1; title "Simple use of FILENAME URL access method, INFILE and INPUT statements"; run; /* In the examples above, the INPUT statement uses list input, which just means that all we specify are the names of the variables to be read. After the first variable's value is read, the pointer is positioned in the column following this value. Assuming there is at least one delimiter between values, the pointer keeps moving until it finds the next variable's value. List input works well for unaligned numeric data that have a delimiter between adjacent values. If the data file contains character data as well as numeric data, list input may not work as we wish. The example below reads a data file that contains character data so the INPUT statement uses the $ format after the variable's name to indicate that it is contains character data. However, when using list input with delimited character data, we can get unexpected results. */ filename foo2 URL "http://www.uvm.edu/~abh/stat295/datasets/sept17_2.dat"; data sept17_2; infile foo2; input id name $ age height weight; run; proc print data=sept17_2; title "Reading unaligned character data with list input"; run; /* Everything looks pretty good, except for one thing. The name Elizabeth is more than 8 characters, and using the $ format tells SAS to read up to a maximum of 8 characters. The rest of the value gets truncated. So now it looks like we will need a combination of list and formatted input in order to tell SAS that the NAME variable can be a maximum of 9 characters instead of 8. */ data sept17_2; infile foo2; input id name $9. age height weight; run; proc print data=sept17_2; title "Reading unaligned character data with list and formatted input"; run; /* Well, we can see from the log that there is a problem since we got the "SAS went to a new line when INPUT statement reached past the end of a line" note and this should not have happened. And we can see from the PROC PRINT output that things are pretty messed up. The problem is that when we specified the $9. format for the name variable, we told SAS to read 9 columns of data no matter what, delimiters be damned. So we need to be extra careful when combining list and formatted input. It turns out that we can fix this problem by adding a single character; the : format modifier. The colon in front of the format tells SAS to read up to a maximum of 9 characters, but stop reading when you come to a delimiter. */ data sept17_2; infile foo2; input id name :$9. age height weight; run; proc print data=sept17_2; title "Reading unaligned character data with list and formatted input"; title2 "Adding the : format modifier"; run; /* The examples above all read unaligned data; that is, the values of the variables are not organized into the same columns on each line. We have more options available to us when the data file is better organized. Consider the following example, which now includes a last name as well as a first name and the values are all column- oriented. We can use column input or formatted input or a combination of list, column, and formatted input to read in the data. */ filename foo3 URL "http://www.uvm.edu/~abh/stat295/datasets/sept17_3.dat"; data sept17_3; infile foo3; input id 1-2 name $ 4-19 age 21-22 height 24-25 weight 27-29; run; proc print data=sept17_3; title "Column input"; run; /* What if a last name was added to the unaligned data? How difficult would it be to read the name variable? It turns out that this is pretty hard to do unless you either read two variable (first name and last name), or if the data file uses different delimiters, or if there are two spaces separating adjacent values. The next example reads unaligned data with a first name and last name variable. */ filename foo4 URL "http://www.uvm.edu/~abh/stat295/datasets/sept17_4.dat"; data sept17_4; infile foo4; input id first_name:$9. last_name:$7. age height weight; run; proc print data=sept17_4; title "Reading unaligned character data"; run; /* If the unaligned data used a comma instead of a space as a delimiter, we could specify that in the INFILE statement a couple of ways. One way is to explicitly define what the delimiter is by using the DELIMITER= option of the INFILE statement. The other way is to use the DSD (delimiter sensitive data) of the INFILE statement. What's the difference between these two methods? Using the DELIMITER option to specify a comma as a delimiter means that two consecutive commas is treated just like two consecutive spaces would be treated in our other examples. Using the DSD option, SAS treats two consecutive commas as having a missing value between them. It also removes any quotation marks from around data values and reads them as character data. */ filename foo5 URL "http://www.uvm.edu/~abh/stat295/datasets/sept17_5.dat"; data sept17_5; infile foo5 delimiter=','; input id name:$16. age height weight; run; proc print data=sept17_5; title "Reading comma delimited data with the delimiter= option"; run; data sept17_5; infile foo5 dsd; input id name:$16. age height weight; run; proc print data=sept17_5; title "Reading comma delimited data with the dsd option"; run; /* In order to read embedded blanks within values (such as the space between first and last names), two or more consecutive spaces must follow the variable with the embedded blank. We also must add the & format modifier to tell SAS that the name variable contains embedded blanks. */ filename foo6 URL "http://www.uvm.edu/~abh/stat295/datasets/sept17_6.dat"; data sept17_6; infile foo6; input id name:&$16. age height weight; run; proc print data=sept17_6; title "Reading unaligned data with the & format modifier"; run; /* One of the previous examples used explicit column numbers to read data aligned by column. There are other ways to do the same thing, one of which is to use pointer controls. Recall that the INPUT statement reads data from a file via a pointer and you have complete control of that pointer. We can tell SAS which column to start reading in or even read the variables in a different order. */ data sept17_3; infile foo3; input @4 name $16. @27 weight @21 age @1 id @24 height; run; proc print data=sept17_3; title "Using absolute pointer controls to read aligned data"; run; /* Using the @n style of pointer control, gives the explicit column number to move the pointer to. The +n style of pointer control moves the pointer n columns relative to its current position. With this style of pointer control, we can completely skip over a variable and not read it. */ data sept17_3; infile foo3; input @1 id 2. +1 name $16. +4 height 2. +1 weight; run; proc print data=sept17_3; title "Using relative pointer controls to read aligned data"; run; /* There are even pointer control options that allow you to read multiple observations from a single line in the data file. The @@ (double-trailing at-sign) tells SAS to hold the pointer tightly and never move it to the next line of data unless there is no more data on the current line. */ filename foo7 URL "http://www.uvm.edu/~abh/stat295/datasets/sept17_7.dat"; data sept17_7; infile foo7; input id first_name:$9. last_name:$7. age height weight @@; run; proc print data=sept17_7; title "Using pointer controls to read multiple observations per line"; run; /* The log tells us that "SAS went to a new line when INPUT statement reached past the end of a line" but this note is expected this time, because of the use of the @@ on the INPUT statement. */ /* Recall that in several examples, SAS went to the next line of data when it could not find any more on the current one. This behavior is a result of a default option of the INFILE statement called FLOWOVER. The pointer "flows over" to the next line of data. There are several alternate options that change this default behavior, including STOPOVER, MISSOVER, and TRUNCOVER. When the INPUT statement cannot find any more data on the current line (even though there are more variables to read), the STOPOVER option of the INFILE statement will treat this as an error and stop building the data set. The MISSOVER option will set all remaining variables to missing for which the INPUT statement cannot find data on the current line. The TRUNCOVER option will write out whatever data it may have been able to read instead of setting the variables to missing. The only option I have ever used instead of the default FLOWOVER is the MISSOVER option. */ filename foo8 URL "http://www.uvm.edu/~abh/stat295/datasets/sept17_8.dat"; data sept17_8; infile foo8 STOPOVER; input id first_name:$9. last_name:$7. age height weight; run; proc print data=sept17_8; title "Using the STOPOVER option"; run; data sept17_8; infile foo8 MISSOVER; input id first_name:$9. last_name:$7. age height weight; run; proc print data=sept17_8; title "Using the MISSOVER option"; run; /* While the INFILE and INPUT statements are used to read data from an external file, the FILE and PUT statements are used to write data to an external file. The PUT statement can use the same types of pointer controls as the INPUT statement. The use of FILE and PUT statements gives you basic "report writing" capabilities. You can read in your data, use SAS to manipulate it, then write it back out to another file. The following example expands (just a little) on one of the data sets used above. We read in the data, create a new variable, then write it back out. Before we create the "name" variable by concatenating the first and last names, we must define what the maximum length of the variable will be. */ data sept17_4; infile foo4; input id first_name:$9. last_name:$7. age height weight; length name $ 17 ; name = trim(first_name) || " " ||trim(last_name); run; data _null_; set sept17_4; file "Z:\public_html\stat295\datasets\put_example.dat"; put @1 id @5 name @25 age @30 height @35 weight; run;