Subsetting and merging datasets
A subset can be a selection of variables (e.g. if we have many and want to remove the one we do not use in the analysis) or a selection of observations (e.g. if there is a special interest in the observations with a specific feature).
If we want to merge two dataset there are also two possibilities: (i) we have two data sets with the same variables and we want to combine these or (ii) we have two data sets with different variables and we want to combine them using one common variable
Make a subset of observations
We use logical operators to define which observations to include:
data indata4;
set indata3;
if concent >= 13;
run;
or
data indata4;
set indata3;
if concent < 13 then delete;
run;
Make a subset of variables
For keeping or dropping variables we use KEEP and DROP. If we have 5 variables in the data set indata3 (date, type, concent, logconcent, month) and we want to make a data set that only contains type, logconcent and month we use:
data indata4;
set indata3;
drop concent date;
run;
or,
data indata4;
set indata3
keep type logconcent month;
run;
Merge two data sets that have the same variables
Here are two data sets. Read them from the SAS editor with the programs below. In the third data step we combine these datasets to one by setting the second below the first. The format line is not necessary but gives us dates instead of a numeric value in the proc print.
data monday;
input date:yymmdd10. sales;
format date:yymmdd10.;
cards;
2006/10/02 3410
2006/10/09 2300
2006/10/16 1390
2006/10/23 900
2006/10/30 2100
; run;/p>
data tuesday;
input date:yymmdd10. sales;
format date:yymmdd10.;
cards;
2006/10/03 1500
2006/10/10 1900
2006/10/17 2810
2006/10/24 1700
; run;
data mondaytuesday;
set monday tuesday;
run;
Look at the combined data set:
proc print data=mondaytuesday;
run;
Merge two data sets using a common variable:
The merge statement is used when we have a variable that has the same values in both data sets. Create the data set mondaytuesday above and combine it with the data set weather below:
data weather;
input date:yymmdd10. weather $10.;
format date:yymmdd10.;
cards;
2006/10/02 very sunny
2006/10/03 cloudy
2006/10/09 sunny
2006/10/10 sunny
2006/10/16 cloudy
2006/10/17 sunny
2006/10/23 rainy
2006/10/24 cloudy
2006/10/30 cloudy
; run;
The common variable is date, we indicate this using by. Both data sets need to be sorted according to this variable. Use proc sort for this.
data datasetcombined;
merge mondaytuesday weather;
by date;
run;
Use proc print to see the combined dataset.