/*	Reading from external data files using INFILE and INPUT statements

	While most people are introduced to inputting data into SAS by 
	including the data in the SAS program, it is more common to 
	have data in a pre-existing file external to SAS and to tell SAS 
	where and how to read this data file.  The "where" is done via 
	the INFILE statement.  The "how" is done via the INPUT statement 
	which you are already familiar with.  The INPUT statement is 
	potentially the most complicated statement in the entire SAS 
	language.  It is the heart of SAS's ability to read just about 
	any data file of any format in combination with the INFILE 
	statement, and as we will see below, the INPUT statement can be 
	remarkably simple or it can extremely complex.
	
	SAS reads data by using a "pointer" and the INPUT statement is 
	the pointer control command.  In the example below, the INPUT 
	statement reads data via "list input" which is the simplest 
	form of input.  This assumes that the data file contains unaligned 
	data, with adjacent values separated by at least one delimiter, 
	usually a single space, but the delimiter can also be a tab 
	character or comma.  This form of the INPUT statement often works 
	with data that is aligned in columns.
*/

data sept17_1;
     infile "Z:\public_html\stat295\datasets\sept17_1.dat";
	 input id age height weight;
run;

proc print data=sept17_1;
title "Simple use of INFILE and INPUT statements";
run;


/*
	When a data file is accessible over the Internet, we can use a 
	FILENAME statement in addition to an INFILE statement to read 
	data from that file over the net.  The example below reads the same 
	data file as before but uses the URL access method to read it.

	The FILENAME statement is to files as the LIBNAME statement is 
	to directories.  It defines a "nickname" for an existing file.  
	Instead of specifying the actual file name in the INFILE 
	statement, you specify the "nickname" for the file given by 
	the FILENAME statement.

	If you are using SAS version 8, you must to use this FILENAME 
	statement instead of the one in the example below.
	filename foo1 URL "http://www.uvm.edu:80/~abh/stat295/datasets/sept17_1.dat";
*/

filename foo1 URL "http://www.uvm.edu/~abh/stat295/datasets/sept17_1.dat";

data sept17_1;
     infile foo1;
	 input id age height weight;
run;

proc print data=sept17_1;
title "Simple use of FILENAME URL access method, INFILE and INPUT statements";
run;


/*
	In the examples above, the INPUT statement uses list input, which just 
	means that all we specify are the names of the variables to be read.  
	After the first variable's value is read, the pointer is positioned in 
	the column following this value.  Assuming there is at least one 
	delimiter between values, the pointer keeps moving until it finds the 
	next variable's value.  List input works well for unaligned numeric data 
	that have a delimiter between adjacent values.  If the data file contains 
	character data as well as numeric data, list input may not work as we wish.

	The example below reads a data file that contains character data so the 
	INPUT statement uses the $ format after the variable's name to indicate 
	that it is contains character data.  However, when using list input with 
	delimited character data, we can get unexpected results.
*/


filename foo2 URL "http://www.uvm.edu/~abh/stat295/datasets/sept17_2.dat";
data sept17_2;
     infile foo2;
	 input id name $ age height weight;
run;

proc print data=sept17_2;
title "Reading unaligned character data with list input";
run;

/*
	Everything looks pretty good, except for one thing.  The name Elizabeth 
	is more than 8 characters, and using the $ format tells SAS to read up 
	to a maximum of 8 characters.  The rest of the value gets truncated.  So 
	now it looks like we will need a combination of list and formatted input 
	in order to tell SAS that the NAME variable can be a maximum of 9 
	characters instead of 8.
*/

data sept17_2;
     infile foo2;
	 input id name $9. age height weight;
run;

proc print data=sept17_2;
title "Reading unaligned character data with list and formatted input";
run;

/*
	Well, we can see from the log that there is a problem since we got the 
	"SAS went to a new line when INPUT statement reached past the end of a 
	line" note and this should not have happened.  And we can see from the 
	PROC PRINT output that things are pretty messed up.  The problem is that 
	when we specified the $9. format for the name variable, we told SAS to 
	read 9 columns of data no matter what, delimiters be damned.  So we need 
	to be extra careful when combining list and formatted input.  It turns out 
	that we can fix this problem by adding a single character; the : format 
	modifier.  The colon in front of the format tells SAS to read up to a 
	maximum of 9 characters, but stop reading when you come to a delimiter.
*/


data sept17_2;
     infile foo2;
	 input id name :$9. age height weight;
run;

proc print data=sept17_2;
title "Reading unaligned character data with list and formatted input";
title2 "Adding the : format modifier";
run;


/*
	The examples above all read unaligned data; that is, the values of 
	the variables are not organized into the same columns on each line.  
	We have more options available to us when the data file is better 
	organized.  Consider the following example, which now includes a 
	last name as well as a first name and the values are all column-
	oriented.  We can use column input or formatted input or a combination 
	of list, column, and formatted input to read in the data.
*/

filename foo3 URL "http://www.uvm.edu/~abh/stat295/datasets/sept17_3.dat";
data sept17_3;
     infile foo3;
	 input id 1-2 name $ 4-19 age 21-22 height 24-25 weight 27-29;
run;

proc print data=sept17_3;
title "Column input";
run;

/*
	What if a last name was added to the unaligned data?  How difficult would 
	it be to read the name variable?  It turns out that this is pretty hard to 
	do unless you either read two variable (first name and last name), or if the 
	data file uses different delimiters, or if there are two spaces separating 
	adjacent values.

	The next example reads unaligned data with a first name and last name variable.
*/

filename foo4 URL "http://www.uvm.edu/~abh/stat295/datasets/sept17_4.dat";
data sept17_4;
     infile foo4;
	 input id first_name:$9. last_name:$7. age height weight;
run;

proc print data=sept17_4;
title "Reading unaligned character data";
run;

/*
	If the unaligned data used a comma instead of a space as a delimiter,
	we could specify that in the INFILE statement a couple of ways.

	One way is to explicitly define what the delimiter is by using the 
	DELIMITER= option of the INFILE statement.  The other way is to use 
	the DSD (delimiter sensitive data) of the INFILE statement.  What's the 
	difference between these two methods?  Using the DELIMITER option to 
	specify a comma as a delimiter means that two consecutive commas is 
	treated just like two consecutive spaces would be treated in our other 
	examples.  Using the DSD option, SAS treats two consecutive commas as 
	having a missing value between them.  It also removes any quotation marks 
	from around data values and reads them as character data.
*/

filename foo5 URL "http://www.uvm.edu/~abh/stat295/datasets/sept17_5.dat";
data sept17_5;
     infile foo5 delimiter=',';
	 input id name:$16. age height weight;
run;

proc print data=sept17_5;
title "Reading comma delimited data with the delimiter= option";
run;

data sept17_5;
     infile foo5 dsd;
	 input id name:$16. age height weight;
run;

proc print data=sept17_5;
title "Reading comma delimited data with the dsd option";
run;

/*
	In order to read embedded blanks within values (such as the 
	space between first and last names), two or more consecutive 
	spaces must follow the variable with the embedded blank.  We 
	also must add the & format modifier to tell SAS that the name 
	variable contains embedded blanks.
*/

filename foo6 URL "http://www.uvm.edu/~abh/stat295/datasets/sept17_6.dat";
data sept17_6;
     infile foo6;
	 input id name:&$16. age height weight;
run;

proc print data=sept17_6;
title "Reading unaligned data with the & format modifier";
run;


/*
	One of the previous examples used explicit column numbers to 
	read data aligned by column.  There are other ways to do the 
	same thing, one of which is to use pointer controls.  Recall 
	that the INPUT statement reads data from a file via a pointer 
	and you have complete control of that pointer.  We can tell 
	SAS which column to start reading in or even read the variables 
	in a different order.
*/

data sept17_3;
     infile foo3;
	 input @4 name $16.  @27 weight  @21 age  @1 id  @24 height;
run;

proc print data=sept17_3;
title "Using absolute pointer controls to read aligned data";
run;


/*
	Using the @n style of pointer control, gives the explicit 
	column number to move the pointer to.  The +n style of 
	pointer control moves the pointer n columns relative to 
	its current position.  With this style of pointer control, 
	we can completely skip over a variable and not read it.
*/

data sept17_3;
     infile foo3;
	 input @1 id 2. +1 name $16. +4 height 2.  +1 weight;
run;

proc print data=sept17_3;
title "Using relative pointer controls to read aligned data";
run;

/*
	There are even pointer control options that allow you to 
	read multiple observations from a single line in the data 
	file.  The @@ (double-trailing at-sign) tells SAS to hold 
	the pointer tightly and never move it to the next line of 
	data unless there is no more data on the current line.
*/

filename foo7 URL "http://www.uvm.edu/~abh/stat295/datasets/sept17_7.dat";
data sept17_7;
     infile foo7;
	 input id first_name:$9. last_name:$7. age height weight @@;
run;

proc print data=sept17_7;
title "Using pointer controls to read multiple observations per line";
run;

/*
	The log tells us that "SAS went to a new line when INPUT 
	statement reached past the end of a line" but this note 
	is expected this time, because of the use of the @@ on 
	the INPUT statement.
*/

/*
	Recall that in several examples, SAS went to the next line 
	of data when it could not find any more on the current one.  
	This behavior is a result of a default option of the INFILE 
	statement called FLOWOVER.  The pointer "flows over" to the 
	next line of data.  There are several alternate options that 
	change this default behavior, including STOPOVER, MISSOVER, 
	and TRUNCOVER.

	When the INPUT statement cannot find any more data on the 
	current line (even though there are more variables to read),
	the STOPOVER option of the INFILE statement will treat this 
	as an error and stop building the data set.

	The MISSOVER option will set all remaining variables to 
	missing for which the INPUT statement cannot find data on 
	the current line.

	The TRUNCOVER option will write out whatever data it may have 
	been able to read instead of setting the variables to missing.

	The only option I have ever used instead of the default FLOWOVER 
	is the MISSOVER option.
*/


filename foo8 URL "http://www.uvm.edu/~abh/stat295/datasets/sept17_8.dat";
data sept17_8;
     infile foo8 STOPOVER;
	 input id first_name:$9. last_name:$7. age height weight;
run;

proc print data=sept17_8;
title "Using the STOPOVER option";
run;

data sept17_8;
     infile foo8 MISSOVER;
	 input id first_name:$9. last_name:$7. age height weight;
run;

proc print data=sept17_8;
title "Using the MISSOVER option";
run;

/*
	While the INFILE and INPUT statements are used to read data from 
	an external file, the FILE and PUT statements are used to write 
	data to an external file.  The PUT statement can use the same 
	types of pointer controls as the INPUT statement.  The use of 
	FILE and PUT statements gives you basic "report writing" 
	capabilities.  You can read in your data, use SAS to manipulate 
	it, then write it back out to another file.

	The following example expands (just a little) on one of the data 
	sets used above.  We read in the data, create a new variable, 
	then write it back out.  Before we create the "name" variable by 
	concatenating the first and last names, we must define what the 
	maximum length of the variable will be.
*/

data sept17_4;
     infile foo4;
     input id first_name:$9. last_name:$7. age height weight;
     length name $ 17 ;
     name = trim(first_name) || " " ||trim(last_name);
run;

data _null_;
     set sept17_4;
     file "Z:\public_html\stat295\datasets\put_example.dat";
     put @1 id  @5 name  @25 age  @30 height  @35 weight;
run;