A common interview question for SAS jobs is "What is the difference
between proc sort nodup and proc sort nodupkey?". The answer
the interviewer is expecting is usually "proc sort nodup gets rid
of duplicate records with the same sort key but proc sort nodupkey
gets rid of other records with the same sort key". However, this is
not correct.
"nodup" is an alias for "noduprecs" which appears to mean "no duplicate
records" but there is no way sas can know about these duplicate records
unless they, by chance, land next to each other in sequence. That is a
matter of chance. Take a look at "nodup" at work. Note the record with
the "extra" value of 3. It's still there after the "nodup" sort.
| data test1;
input id1 $ id2 $ extra ; cards; aa ab 3 aa ab 1 aa ab 2 aa ab 3 ; proc sort nodup data=test1; by id1 id2; run; options nocenter; proc print data=test1; run; Obs id1 id2 extra 1 aa ab
3
|
Now look again where the two records with an "extra" value of 3 are
next to each other in the input dataset. This time it has been removed
by "nodup".
| data test2;
input id1 $ id2 $ extra ; cards; aa ab 3 aa ab 3 aa ab 2 aa ab 1 ; proc sort nodup data=test2; by id1 id2; run; options nocenter; proc print data=test2; run; Obs id1 id2 extra 1 aa ab
3
|
If you sort "nodupkey" then you will only be left with one record
with that key combination in the above case as you can see below.
| data test3;
input id1 $ id2 $ extra ; cards; aa ab 3 aa ab 3 aa ab 2 aa ab 1 ; proc sort nodupkey data=test3; by id1 id2; run; options nocenter; proc print data=test3; run; Obs id1 id2 extra 1 aa ab
3
|
It is a big mistake to think sorting "nodup" will remove duplicate records.
Sometime it will, sometime it won't. The only way you can be sure of removing
duplicate records is to "proc sort nodupkey" and include enough
key variables to be sure you will lose the duplicates you want to lose.
In the case shown above, then if we knew of the same "extra" values being
duplicates we wanted to remove then this variable should be included in
the list of sort variables and then "nodupkey" will remove the duplicates
as shown below.
| data test4;
input id1 $ id2 $ extra ; cards; aa ab 3 aa ab 1 aa ab 2 aa ab 3 ; proc sort nodupkey data=test4; by id1 id2 extra; run; options nocenter; proc print data=test4; run; Obs id1 id2 extra 1 aa ab
1
|
Go back to the home page.
E-mail the macro and web site author.