[last updated - 12 January 2003]
This is one of these errors you make when working under pressure. You are merging two datasets together. You are being very careful to have only the variables you want in the keep lists. You make sure the "where" clause is correct. You carefully code the rest of the data step. You run it. It works. There are no errors or warning. You move on to the next task in the certain knowledge that your code is correct. BUT - you have forgotten the "by" statement.
Take a look at this log and print. It is the simplest illustration of this type of error. I am merging a treatment dataset with an AE dataset but have missed out the "by" statement. Do you see any errors in the log? Any warnings? No, there are none. The observations will be merged one on one in this case since no "by" statement was specified. You end up with subj=1 having two different treatment groups and subj=2 having none.
Make an error like that for a pharmaceutical company and you are dead meat. Miss it on a QC and you are dead meat.
47 data tmt;
48 subj=1;tmt=0;output;
49 subj=2;tmt=1;output;
50 run;
NOTE: The data set WORK.TMT has 2 observations and 2 variables.
NOTE: DATA statement used: real time 0.04 seconds
51
52 data aes;
53 length ae $ 10;
54 subj=1;ae='cough';output;
55 subj=1;ae='cold';output;
56 subj=2;ae='flu';output;
57 run;
NOTE: The data set WORK.AES has 3 observations and 2 variables.
NOTE: DATA statement used: real time 0.00 seconds
58
59 data bad;
60 merge tmt aes;
61 run;
NOTE: There were 2 observations read from the data set WORK.TMT.
NOTE: There were 3 observations read from the data set WORK.AES.
NOTE: The data set WORK.BAD has 3 observations and 3 variables.
NOTE: DATA statement used: real time 0.05 seconds
62
63 options nocenter;
64 proc print data=bad;
65 run;
NOTE: There were 3 observations read from the data set WORK.BAD.
NOTE: PROCEDURE PRINT used: real time 0.05 seconds
Obs subj tmt ae
1 1 0 cough
2 1 1 cold
3 2 . flu
77
78 data bad;
79 merge tmt(in=tmt) aes(in=aes);
80 by subj;
81 if tmt and aes;
82 run;
NOTE: The variable tmt exists on an input data set, but was also
specified in an I/O statement option. The variable will not be included
on any output data set.
NOTE: There were 2 observations read from the data set WORK.TMT.
NOTE: There were 3 observations read from the data set WORK.AES.
NOTE: The data set WORK.BAD has 1 observations and 2 variables.
NOTE: DATA statement used: real time 0.05 seconds
83
84 options nocenter;
85 proc print data=bad;
86 run;
NOTE: There were 1 observations read from the data set WORK.BAD.
NOTE: PROCEDURE PRINT used: real time 0.04 seconds
Obs subj ae
1 2 flu
Do you see that the treatment variable is missing from the output dataset and that there is only one observation? Look at that first NOTE in the log after the merge. The situation now is not as bad as it used to be. In the good old days the treatment variable will still have been there on the output dataset and you would have not got a NOTE to that effect. So if you make this error these days it will be spotted. If you look at the code written by old-timer SAS programmers you will see some of them put an underscore as the first character of the variable defined to the "in" statement, the assumption being that nobody would have put a variable in a dataset that starts with an underscore so no clash could occur. It is to get round this problem that used to be so damaging in the past. But this problem of declaring a flag variable that already exists occurs again and again in another form as in the next example.
Well what has happened is that good ideas tend to be reused. Either the same programmer or a different programmer also thought it would be a good idea to use a "flag" variable in a data step. But they didn't think to drop it from the output dataset. So the flag variable is already in the input dataset. With each observation you are reading you could be resetting your flag variable. And THIS is why the output dataset is not as you expected it to be.
If you are lucky then this problem will drive you mad and stop you making further progress until you have spotted the problem. If you are unlucky (which is far more likely) then errors will be introduced that you have not spotted. This sort of convoluted logic is more often found in efficacy analysis where accuracy is of the utmost importance. So it will be the efficacy programmer or statistician who make the above mistake. Their turn to be dead meat.
I've kept this example as simple as possible. Take a look at the following log and the print.
87 data tmt;
88 subj=1;tmtarm='On drug';output;
89 subj=2;tmtarm='Placebo';output;
90 run;
NOTE: The data set WORK.TMT has 2 observations and 2 variables.
NOTE: DATA statement used: real time 0.05 seconds
91
92 data aes;
93 length ae $ 10;
94 subj=1;ae='cough';output;
95 subj=1;ae='cold';output;
96 subj=1;ae='vomiting';output;
97 subj=1;ae='death';output;
98 subj=2;ae='flu';output;
99 run;
NOTE: The data set WORK.AES has 5 observations and 2 variables.
NOTE: DATA statement used: real time 0.00 seconds
100
101 data bad;
102 merge tmt(in=_tmt) aes(in=_aes);
103 by subj;
104 if _tmt and _aes;
105 if subj EQ 1 and ae='cold' then tmtarm='Placebo'; /* data fix
*/
106 run;
NOTE: There were 2 observations read from the data set WORK.TMT.
NOTE: There were 5 observations read from the data set WORK.AES.
NOTE: The data set WORK.BAD has 5 observations and 3 variables.
NOTE: DATA statement used: real time 0.10 seconds
107
108 options nocenter;
109 proc print data=bad;
110 run;
NOTE: There were 5 observations read from the data set WORK.BAD.
NOTE: PROCEDURE PRINT used: real time 0.00 seconds
Obs subj tmtarm ae
1 1 On drug cough
2 1 Placebo cold
3 1 Placebo vomitting
4 1 Placebo death
5 2 Placebo flu
If you look at the above you will see that although it was intended to move subj=1 ae of "cold" to the placebo group only, this setting has remained for the following aes. "Vomiting" and "death" are now in the Placebo arm when they should have been in the "On Drug" arm.
Maybe you make a few errors like that and some serious AEs that happened on drug are assigned to the Placebo arm. Your submission goes to the FDA with a table that shows that just as many serious AEs happened on placebo as on drug. Your drug gets approved. It goes to market. Then somebody later discovers your little "faux pas" and are obliged to report it to the FDA. You get a multi-million dollar fine and your drug gets pulled. They send the auditors in to audit your entire Biometrics department at your company's expense. Oops!
Make this mistake and report a serious AE in the wrong treatment arm and you are dead meat. Fail to spot it during a QC and you are dead meat.
Make that mistake as a programmer and you can plead ignorance. You can say "Oh, it is? I didn't realize that". Make that mistake as a stato and you get a red face. Keep making that mistake as programmers and statos and keep reissuing tables and you manager is dead meat.