Deadly errors you can make

[This site is not connected with the SAS Institute]

[last updated - 12 January 2003]

Introduction

If you search the web and past papers about SAS you will find a list of common errors that you can make using SAS. But many of these will cause syntax errors in any case and a lot of them seem to be to do with commenting out sections of code. Let me assure you that this page will be different. The errors I will list here will be very few in number and they are errors that you will make in your career. And errors so serious that it might shorten or end your career because of the embarrassment or financial damage to your employer. No matter how expert you are with the language, you are liable to make the following mistakes, if you are not forewarned. The list that follows are mistakes that have come from my own experience in that I have nearly made them myself or I have come across them in other people's code. Most of them have to do with a simple merge of two datasets.

Missing "by" statement in merge

There is an option you can set to detect missing by statements in merges. You can specify mergenoby=nowarn|warn|error. It might be a good idea to always set this to mergenoby=error at the head of your code but you have got to keep in mind that you might be calling standard macros that do unusual merges and so you might get an error message coming out of one of them. If you don't, then all well and good. Maybe you have an autoexec that you could specify this option in or a piece of code you always "%inc"lude at the top of every program and so you could put it in there. If you can do this so you are sure it is there, then you will never hit the problem I am describing next.

This is one of these errors you make when working under pressure. You are merging two datasets together. You are being very careful to have only the variables you want in the keep lists. You make sure the "where" clause is correct. You carefully code the rest of the data step. You run it. It works. There are no errors or warning. You move on to the next task in the certain knowledge that your code is correct. BUT - you have forgotten the "by" statement.

Take a look at this log and print. It is the simplest illustration of this type of error. I am merging a treatment dataset with an AE dataset but have missed out the "by" statement. Do you see any errors in the log? Any warnings? No, there are none. The observations will be merged one on one in this case since no "by" statement was specified. You end up with subj=1 having two different treatment groups and subj=2 having none.

Make an error like that for a pharmaceutical company and you are dead meat. Miss it on a QC and you are dead meat.

47 data tmt;
48 subj=1;tmt=0;output;
49 subj=2;tmt=1;output;
50 run;

NOTE: The data set WORK.TMT has 2 observations and 2 variables.
NOTE: DATA statement used: real time 0.04 seconds

51
52 data aes;
53 length ae $ 10;
54 subj=1;ae='cough';output;
55 subj=1;ae='cold';output;
56 subj=2;ae='flu';output;
57 run;

NOTE: The data set WORK.AES has 3 observations and 2 variables.
NOTE: DATA statement used: real time 0.00 seconds

58
59 data bad;
60 merge tmt aes;
61 run;

NOTE: There were 2 observations read from the data set WORK.TMT.
NOTE: There were 3 observations read from the data set WORK.AES.
NOTE: The data set WORK.BAD has 3 observations and 3 variables.
NOTE: DATA statement used: real time 0.05 seconds

62
63 options nocenter;
64 proc print data=bad;
65 run;

NOTE: There were 3 observations read from the data set WORK.BAD.
NOTE: PROCEDURE PRINT used: real time 0.05 seconds

Obs subj tmt ae

 1    1   0 cough
 2    1   1 cold
 3    2   . flu

Existing "in" variable in merge

Let us suppose you would never, ever miss off a "by" statement in a merge as in the previous example, even if you were working under extreme pressure and doing compulsory overtime. You are too good for that. And because you realize the importance of using meaningful variable names you assign meaningful names to the "in" statement when you are merging your two datasets. So let us look at a corrected version of the code above and its log and print.

77
78 data bad;
79 merge tmt(in=tmt) aes(in=aes);
80 by subj;
81 if tmt and aes;
82 run;

NOTE: The variable tmt exists on an input data set, but was also specified in an I/O statement option. The variable will not be included on any output data set.
NOTE: There were 2 observations read from the data set WORK.TMT.
NOTE: There were 3 observations read from the data set WORK.AES.
NOTE: The data set WORK.BAD has 1 observations and 2 variables.
NOTE: DATA statement used: real time 0.05 seconds

83
84 options nocenter;
85 proc print data=bad;
86 run;

NOTE: There were 1 observations read from the data set WORK.BAD.
NOTE: PROCEDURE PRINT used: real time 0.04 seconds

Obs subj ae

 1    2  flu
 

Do you see that the treatment variable is missing from the output dataset and that there is only one observation? Look at that first NOTE in the log after the merge. The situation now is not as bad as it used to be. In the good old days the treatment variable will still have been there on the output dataset and you would have not got a NOTE to that effect. So if you make this error these days it will be spotted. If you look at the code written by old-timer SAS programmers you will see some of them put an underscore as the first character of the variable defined to the "in" statement, the assumption being that nobody would have put a variable in a dataset that starts with an underscore so no clash could occur. It is to get round this problem that used to be so damaging in the past. But this problem of declaring a flag variable that already exists occurs again and again in another form as in the next example.

"Flag" variables in data steps

I have seen examples of code in data steps where the logic has become quite convoluted and a "flag" variable is set and changed. It is often called "flag" as well so that it is clear that this is a flag variable. But what appears in that output dataset is scientifically impossible in that it defies all logic. You look at the data step code again and again. You maybe write out examples using paper and pencil. Still the output defies all logic. You go for a cup of coffee to clear your head. You go to lunch, maybe, to give yourself the energy to tackle it anew. You come back to it refreshed and replenished and it STILL doesn't make any sense.

Well what has happened is that good ideas tend to be reused. Either the same programmer or a different programmer also thought it would be a good idea to use a "flag" variable in a data step. But they didn't think to drop it from the output dataset. So the flag variable is already in the input dataset. With each observation you are reading you could be resetting your flag variable. And THIS is why the output dataset is not as you expected it to be.

If you are lucky then this problem will drive you mad and stop you making further progress until you have spotted the problem. If you are unlucky (which is far more likely) then errors will be introduced that you have not spotted. This sort of convoluted logic is more often found in efficacy analysis where accuracy is of the utmost importance. So it will be the efficacy programmer or statistician who make the above mistake. Their turn to be dead meat.

"Fixing" data during a merge

Very rarely something so unusual happens with the data that the only way you can handle it is to "fix" the data in the code. It vary rarely happens that a subject can end up in the wrong treatment arm and experience a recorded event while under treatment in that wrong treatment arm. And when you come to do the safety tables you need to assign that event to the correct treatment arm. So to do this you reset the treatment arm for just that one subject and their one event. And because you are mindful of efficiency and because that dataset is huge you don't waste processing time by doing this in a following data step. You do it in the merge itself.

I've kept this example as simple as possible. Take a look at the following log and the print.

87 data tmt;
88 subj=1;tmtarm='On drug';output;
89 subj=2;tmtarm='Placebo';output;
90 run;

NOTE: The data set WORK.TMT has 2 observations and 2 variables.
NOTE: DATA statement used: real time 0.05 seconds

91
92 data aes;
93 length ae $ 10;
94 subj=1;ae='cough';output;
95 subj=1;ae='cold';output;
96 subj=1;ae='vomiting';output;
97 subj=1;ae='death';output;
98 subj=2;ae='flu';output;
99 run;

NOTE: The data set WORK.AES has 5 observations and 2 variables.
NOTE: DATA statement used: real time 0.00 seconds

100
101 data bad;
102 merge tmt(in=_tmt) aes(in=_aes);
103 by subj;
104 if _tmt and _aes;
105 if subj EQ 1 and ae='cold' then tmtarm='Placebo'; /* data fix */
106 run;

NOTE: There were 2 observations read from the data set WORK.TMT.
NOTE: There were 5 observations read from the data set WORK.AES.
NOTE: The data set WORK.BAD has 5 observations and 3 variables.
NOTE: DATA statement used: real time 0.10 seconds

107
108 options nocenter;
109 proc print data=bad;
110 run;

NOTE: There were 5 observations read from the data set WORK.BAD.
NOTE: PROCEDURE PRINT used: real time 0.00 seconds

Obs subj tmtarm  ae

1     1  On drug cough
2     1  Placebo cold
3     1  Placebo vomitting
4     1  Placebo death
5     2  Placebo flu

If you look at the above you will see that although it was intended to move subj=1 ae of "cold" to the placebo group only, this setting has remained for the following aes. "Vomiting" and "death" are now in the Placebo arm when they should have been in the "On Drug" arm.

Maybe you make a few errors like that and some serious AEs that happened on drug are assigned to the Placebo arm. Your submission goes to the FDA with a table that shows that just as many serious AEs happened on placebo as on drug. Your drug gets approved. It goes to market. Then somebody later discovers your little "faux pas" and are obliged to report it to the FDA. You get a multi-million dollar fine and your drug gets pulled. They send the auditors in to audit your entire Biometrics department at your company's expense. Oops!

Make this mistake and report a serious AE in the wrong treatment arm and you are dead meat. Fail to spot it during a QC and you are dead meat.

Averaging averages

You probably won't be fired for making this error but it is still a very bad one to make and can result in tables being reissued and resent. Your manager might get fired if it happens too often but not you. The error is in averaging averages. Even statos trip up on this one and you would have thought they would be the very last people to make this mistake. Let us take daily dose as an example. You have to produce a table showing the average daily dose. You have a week at the start of the trial at half dose and then two two-week periods on full dose. Let us say that in the first week the subjects took one tablet a day and in the following two two-week periods they took two tablets a day until they stopped. So is the average dose (1+2+2)/3 ? No, it is not, because the period on half dose was half of that on full dose. Also, suppose some subjects dropped out of the final two week period. Their time on half dose would play a bigger part. The proper way to calculate it is to divide the total number of tablets taken by the total number of days. So suppose you have correctly calculated this average for each subject by dividing the number of tablets by the number of days on trial. So then you average all these values for each subject to give you your final result. Is that right? No, it is not. It is averaging averages again. To get the correct result you would have to divide the number of tablets taken by all subjects and divide by the number of days on trial added up for all subjects.

Make that mistake as a programmer and you can plead ignorance. You can say "Oh, it is? I didn't realize that". Make that mistake as a stato and you get a red face. Keep making that mistake as programmers and statos and keep reissuing tables and you manager is dead meat.