Using unistats with Lab data (4)

Last updated: 23 Jan 2012

Introduction

This page is not about using %unistats with lab data. Instead, it is a page about how to efficiently extract your lab data and how to efficiently handle it before making your call to %unistats. Once you call %unistats then your performance problems are over because %unistats is an extremely efficient macro and can summarize huge lab datasets, usually in a matter of seconds. If you are using %unistats with lab data and your program runs slow then it will not be the fault of %unistats. It will be because you are not extracting and handling the lab data in the optimum way.

The problem with lab data

The problem with lab data, simply put, is that it is too large for computers to handle easily. There are too many observations and too many variables so it takes up too much space, making it slow to read in and too large to optimise computer speed by using the computer's working memory. Computers are limited devices. If a data source stored on a computer disk is very large then it has to be read directly off the disk magnetic media which is slow (about 40 Mbytes per second). Computers have communication channels which might be slow and shared among several users (maybe only 125 Mbytes per second shared between all users). They have limited working memory for holding data. Because of the limited working memory, the data has to be shuffled back and forth to work disks which has the effect of delaying itself and in addition this could be happening in competition with other usersl. In this case, you could end up with the computer "thrashing".

Computer are optimized to work with relatively small chunks of data and their performance is impressive in these circumstances. But once you go outside this comfortable limit, things slow down a lot ! The worst thing you can do with lab data is to make it any larger than it already is.

People can come up with many good reasons why lab data should have more variables added to it, for QC purposes, reference purposes or whatever, but the amount of data is already too large and adding more variables makes it more difficult for a computer to process the data.

The computer can cope with pure lab data for an average study but processing times will increase in cubic proportions once you get past a certain limit (approximately half a gigabyte). Double it and your processing can take eight times longer. Increase it five fold and your processing could take 125 times longer. If you have pooled study data and you add many variables to the lab data then processing it could take days! If you are processing pooled study lab data then you already have a problem. It is best not to make that problem worse.

"Derived" or "value added" datasets are popular and it is easier to produce tables from them but for lab data you have got to be very careful in adding extra variables because of the increase in processing time that will result.

Another problem comes from text variables being much larger than they need to be because somebody thought it would be a good idea to "standardise" them and may have increased the length of some character fields to 200 bytes. If this is the case then you should explain the problem this causes and ask for them to be made smaller. Even if you are not keeping these variables you will still be inconvenienced because it will take longer to read the lab data.

What is "slow" with lab data?

If your lab dataset contains one million observations then to process it and summarize it should take less than ten minutes even in the most complex cases. If it takes longer then somebody is doing something wrong.

What is "fast" with lab data?

You are starting to get fast with lab data when you are processing 10,000 lab records per second. If you get to a speed of processing 15,000 lab records per second then this is extremely good and you probably have the computer to thank for some of this speed.

What is the Big Secret of efficient lab data handling?

The Big Secret about handling lab data is mostly to do with limiting the number of retained variables , keeping the SQL simple and sorting the data as infrequently as possible

Save only those variables you need. You can access more if you need to derive some variables and you can do that in SQL but only keep the ones you need for future processing
Create a useful lab datetime variable from the date and time variable (if not present) using "dhms(labdt,0,0,labtm) as labdttm" in your SQL and drop the data and time variables if possible. Even if you need to use labdt for a calculation of study day then you can use datepart(labdttm) instead.
Before quitting SQL, sort the lab data into study / patient / lab code / visit number / repeat visit number order using an "order by" statement
Merge with your patient information (which should already be in study / patient order) in a data step using "merge" AFTER you leave SQL so you know it will be done efficiently
NEVER sort your lab data after merging with patient information unless you feel it is already nearly in the order you want
FLAG your lab observations as much as you need to, such as those that are pre-treatment, on-treatment, post-treatment, repeat values, the baseline value and last-value-on-treatment

Extracting Patient and Lab data using SQL

Keep patient data and lab data separate in your SQL. Do a final sort in study / patient / lab code / visit number / repeat visit number order for lab data. Do a final sort for your patient data in study / patient order. Use "left join" a lot because it allows you to reference variables to derive other values but you don't have to keep them. When you do your final sort of lab data you should only keep those variables you need to reference in the future.

Some efficient example SQL

If you use SQL as shown below then your problems should be avoided or at least minimised. It will give you speed as well as flexibilty. It will work well for pooled study data. You will see that the SQL is very simple and I will be doing a lot of other work outside of SQL. It is not that I do not like using SQL. It is because it is hard to predict how SQL will behave when it is handling huge volumes of data. In those circumstances you at least know what is happening if you use data steps and the results will be predictable.

Look how the patient information is kept separate and is built up using "left join"s using "on" statements to link the study and patient information to the population information for each table
Look how I am keeping two variables that are resolved by macro parameters &trtvar and &trtdc. These are the coded and decoded forms. The decoded form might be a long variables so we should not keep more of these variables than we need.
Look how the very first table used in building the patient data is the population table so that with the following "left join" processing then only information for the desired patient population is kept
Look how patinfo datetime variables are being created using date and time variables but only the datetime variables are kept
After the patient table has been created then some useful information is extracted from this relatively small _patinfo table and stored in macro variables
Look how the lab grouping data has been kept separate. It makes no sense to merge it with the lab data at this stage. It only makes sense to merge it after the lab data has been summarized and note that I have the labnmx (the label of the lab parameter) in here and not in the lab data to save space as I know the merge with this data will work
Look how the lab data has been "left join"ed with some reference range data to calculate normalized values and that both these sets of reference range variables are not kept
Look how the lab data is sorted into study / patient order as the first two sort keys at the end for a later data step merge with the patient data
Look how I am selecting ALL the lab data and not applying any patient population subsetting. This is because it is more efficient to do this at a later stage

proc sql noprint;

    create table _patinfo as (
      select pop.study, pop.ptno, pop.popu, pop.popudc,
      trt.analno, trt.anallbl,
      trt.&trtvar, trt.&trtdc,
      dhms(pre.atrstdt,0,0,pre.atrsttm) format=datetime19. label="pre-treatment start" as predtst,
      dhms(trt.atrstdt,0,0,trt.atrsttm) format=datetime19. label="on-treatment start"   as ontdtst,
      dhms(trt.atrspdt,0,0,trt.atrsptm) format=datetime19. label="on-treatment stop"    as ontdtsp,
      dhms(pos.atrstdt,0,0,pos.atrsttm) format=datetime19. label="post-treatment start" as posdtst,
      dhms(pos.atrspdt,0,0,pos.atrsptm) format=datetime19. label="post-treatment stop" as posdtsp,
      dhms(pst.atrstdt,0,0,pst.atrsttm) format=datetime19. label="post-study start"     as pstdtst,
      dhms(pst.atrspdt,0,0,pst.atrsptm) format=datetime19. label="post-study stop"      as pstdtsp

/*--- population ---*/
from &inpopu(where=(popu="&popu" and popuny=1)) as pop

      /*--- on treatment ---*/
      left join &intrt(where=(analno=&analno)) as trt
      on pop.study=trt.study and pop.ptno=trt.ptno

      /*--- pre treatment ---*/
      left join &intrt(where=(analno=6)) as pre
      on pop.study=pre.study and pop.ptno=pre.ptno

      /*--- post treatment ---*/
      left join &intrt(where=(analno=1 and atrcd="90000003")) as pos
      on pop.study=pos.study and pop.ptno=pos.ptno

      /*--- post study ---*/
      left join &intrt(where=(analno=1 and atrcd="90000011")) as pst
      on pop.study=pst.study and pop.ptno=pst.ptno

) order by study, ptno;

    select distinct(study)   into: studies separated by " " from _patinfo;
    select distinct(popudc) into: popudc separated by " " from _patinfo;
    select distinct(anallbl) into: anallbl separated by " " from _patinfo;

    *- lab grouping variables -;
    create table _labgrp as (
      select labnm, labnmx, labgrp, labgrpx, labnmor from &instds(where=(type='LG'))
    );

    *- lab values -;
    create table _lab as (
      select a.study, a.ptno, a.labnm, a.lab, a.labun, a.labstd, a.labstdu,
      a.visno, a.cpevent, a.subevno, b.rounding,
      dhms(labdt,0,0,labtm) format=datetime19. label="lab datetime" as labdttm,
      b.lln+(a.labstd-a.llc)*(b.uln-b.lln)/(a.ulc-a.llc)
        label="Lab value norm" format=15.5 as labn
      from &inlab as a
      left join &instds(where=(type='RR' and version=5)) as b
      on a.labnm=b.labnm
     ) order by study, ptno, labnm, visno, subevno, labdttm;

quit;

Post processing of patient data

As soon as we leave SQL you can do some Postprocessing of the patient data. It is often clearer and easier to do it this way. And it can be a useful source for creating extra needed formats. Remember that this dataset is small so to process it is fast. See how I have added some variables and then used the original patient table as a source for creating more formats.

I don't need the two variables anallbl and popudc anymore because I wrote them to macro variables in the SQL step so I drop them here to save space. It's not very important because once we have the lab data in study / patient order then things run fast after that but every small thing helps.

                /************************
                      Extend patinfo
                 ************************/
data _patinfo2;
    set _patinfo;
    *- combine post-treatment and post-study start and stop -;
    *- and make sure that the start is not before the       -;
    *- on-treatment stop datetime.                          -;
    cmbdtst=max(ontdtsp,min(posdtst,pstdtst));
    cmbdtsp=max(posdtsp,pstdtsp);
    format cmbdtst cmbdtsp datetime19.;
    *- these two no longer needed -;
    drop anallbl popudc;
    label cmbdtst="post-treatment/study start"
          cmbdtsp="post-treatment/study stop"
          ;
run;
                /************************
                      Create formats
                 ************************/
*- create a format to map treatment code to the decode label -;
%mkformat(_patinfo,&trtvar,&trtdc,$atrfmt,indent=0);
*- create a format for treatment arm totals -;
%popfmt(dsin=_patinfo,trtvar=&trtvar,trtfmt=$atrfmt.,uniqueid=study ptno,
          split=%str( ),msgs=no);

Merging patient data with lab data

Remember that we already have the dataset in the correct order to merge by study patient because we "order"ed them in this way in the SQL.

See how I am adding flag variables and also only keeping lab values for that population by using "if _a and _b". This is the best time for dropping lab observations from patient populations we are not interested in. We know it will be done efficiently this way.

See how I sort it at the end with the variable "_period" added. This will not be a problem because the lab data is already nearly in this order so the sort step will not take long. The sort algorithms will detect this and use an efficient method.

You can see that I am creating three flag variables for the different time periods. This is inefficient but it was written as QC code and I need to do a "proc compare" against the original dataset. But what I have done is declare them a length of 3 to hopefully save space. If some of your text variables are too large then you should reduce the length of them at this stage so the following sort is faster.

                /************************
                      Add lab flags
                 ************************/
*- _period is used as a work variable which will be dropped -;
*- from _linder and added again at a later stage. Only lab -;
*- values for the specified treatment population are kept. -;
data _lab2;
    length _fgprev _fgontv _fgpostv 3;
    merge _patinfo2(in=_a) _lab(in=_b);
    by study ptno;
    if _a and _b;
    if labdttm<=predtst then _period=0;
    else if predtst<labdttm<=ontdtst then do;
      _period=1;
      _fgprev=1;
    end;
    else if ontdtst<labdttm<=ontdtsp then do;
      _period=2;
      _fgontv=1;
    end;
    else if cmbdtst<labdttm<=cmbdtsp then do;
      _period=9;
      _fgpostv=1;
    end;
    else _period=99;
    drop predtst ontdtst ontdtsp posdtst posdtsp
         pstdtst pstdtsp cmbdtst cmbdtsp;
    label _fgprev="Pre-treatment value flag"
          _fgontv="On-treatment value flag"
          _fgpostv="Post-treatment value flag"
          ;
    format _fgprev _fgontv _fgpostv 1.;
run;
proc sort data=_lab2;
    by study ptno labnm _period visno subevno;
run;

Flagging REPEAT values

If "repeat" values are to do with "visno" "FIRST." or "LAST." values then flagging these repeats is easy as the data is already in the order we need. These flag variables make it easy to visually QC the data to make sure the repeat values are being correctly selected.

                /************************
                    Flag REPEAT values
                 ************************/
data _lab3;
    length _fgrept 3;
    set _lab2;
    by study ptno labnm _period visno subevno;

if _fgprev and not last.visno then _fgrept=&reptval;

if _period=99 and not last.visno then _fgrept=&reptval;
else if _period=2 and not &repeat..visno then _fgrept=&reptval;

label _fgrept="Repeated Values";
format _fgrept 4.;
run;

Flagging Baseline and Last-value-on-treatment observations

What you need to do to flag the Baseline and Last-value-on-treatment values is to extract the minimum number of variables to determine which observations these are and then to merge back in, keeping only the variables you are going to merge on and flag them. Note that I am only keeping in these datasets the variables that I will merge back in with the lab data (study ptno labnm _period). Because I am only keeping one value per study/ptno/labnm/_period then I have no need to do a sort before merging. It will already be in that order. The only sorting I do is on these small datasets with minimum variables in them and this will not take long.

The last data step is the merging back in and flagging using variables _fgbslv (baseline value) and _fglastv (last value). Again, these flag variables make it easy to visually QC the data to make sure the correct values are being selected. It is building a dataset with all the flags in it that can be optionally saved for QC purposes and later will be the source dataset for creating the final report.

                /**********************************
                    Baseline and Last Value Flag
                 **********************************/
*- Sometimes this value is not in the correct visno order so it -;
*- needs to be sorted in labdttm order. All values are used.    -;
proc sort data=_lab3(keep=study ptno labnm _period visno subevno labdttm
                      where=(_period=1)) out=_bslval(drop=labdttm);
    by study ptno labnm _period labdttm;
run;
*- keep only the last -;
data _bslval;
    set _bslval;
    by study ptno labnm _period;
    if last.labnm;
run;

*- Sometimes this value is not in the correct visno order so it -;
*- needs to be sorted in labdttm order. Only non-repeats used. -;
proc sort data=_lab3(keep=study ptno labnm _period visno subevno labdttm _fgrept
                      where=(_period=2 and not _fgrept))
                      out=_lastval(drop=labdttm _fgrept);
    by study ptno labnm _period labdttm;
run;
*- keep only the last -;
data _lastval;
    set _lastval;
    by study ptno labnm _period;
    if last.labnm;
run;

*- merge and set flags -;
data _lab4;
    length _fglastv _fgbslv 3;
    merge _bslval(in=_a) _lastval(in=_b) _lab3;
    by study ptno labnm _period visno subevno;
    if _a then _fgbslv=1;
    if _b then _fglastv=1;
    label _fglastv="Last Value Flag"
          _fgbslv="Baseline Flag"
          ;
    format _fgbslv _fglastv 4.;
run;

Do not sort before calling %unistats

You should not sort your lab data before calling %unistats even if you are telling it the "byvars". It will do any needed sorts internally. Here is a call to %unistats after I have added some variables for help in getting the correct sort order plus a variable _phase for "skipping" lines in "proc report".

                /**************************
                         Summarize
                 **************************/

%let wherecls=and _hasbase=1 and _haslast=1;
%if &showmiss EQ Y %then %let wherecls=;

*- Summarize and transpose by statistics name/_tp value using the -;
*- dpvar=rounding variable to control the number of decimal points -;
*- shown in the output character variables. padmiss=yes is used to -;
*- place non-breaking spaces after a missing value in the character -;
*- variables to preserve decimal point alignment when proc report -;
*- right-aligns these character variables. -;

%unistats(dsin=_laball3(where=((_fgprev or _fgontv or _fgpostv) &wherecls and missing(_fgrept))),
print=no,varlist=_val,msglevel=X,padmiss=yes,msgs=no,
trtvar=_tp,trtfmt=tpfmt.,dpvar=rounding,trtvallist=1 2,
byvars=&trtvar labnm &labunit _phase _period visno cpevent,
nfmt=5.,stdfmt=5.,minfmt=5.,maxfmt=5.,meanfmt=5.,
descstats=&descstat,dstranstattrt=_transtat,leftstr=yes);

Finally, I add the lab ordering values

If you remember, in the initial SQL I created a table named _labgrp that I did not "left join" with the lab data because I said it is better done after it had been summarized. Well now I have summarized it so it is time to merge it with the output dataset _transtat. Note that I am setting up a visit text variable here because if I did it earlier then it would make the lab dataset too large and would slow down the summarizing process. You will see that I kept cpevent by declaring it to be a "byvar" in the call to %unistats and I will use to to form the visit text. Now that we are handling summarized data we don't have to worry about the length of text variables so I set _vistext to 60 characters which is more than enough.

                /**************************
                     Add lab group order
                 **************************/

*- add in the lab group info for proc report ordering -;
proc sql noprint;
    create table _transtat2 as (
    select a.*, b.labnmx, b.labgrp, b.labgrpx, b.labnmor,
    case when _period=3 then " Min value on treatment"
         when _period=4 then " Max value on treatment"
         when _period=5 then " &lastlbl"
         else " "||cpevent
    end length=60 as _vistext
    from _transtat as a
    left join _labgrp as b
    on a.labnm=b.labnm
    ) order by labgrp, labgrpx, labnmor, labnm, labnmx, &labunit;
quit;

It's "proc report" time !

I already set up some titles and footnotes and now it is time to call "proc report" to produce our lovely report !! Note the use of %sysfunc(byte(131)) for the top line of the report. Byte(131) is a mid line height long hyphen so repeat values will form a solid line that is mid line height. The solid line is better placed than it would be if repeating underscores were used.

Note that I have used "order=internal" for the order variables. This is always a good idea.

                /**************************
                       Produce report
                 **************************/
*- adjust column width according to how many statistics we have in each group -;
%let ncolw=8;
%if %words(&_statkeys_) GT 6 %then %let ncolw=6;
*- for normalised values only show if we have a non-missing rounding variable value -;
%let wherecls=;
%if &value EQ N %then %let wherecls=(where=(not missing(rounding)));

*- The _statkeys_ content is dynamically handled in the proc report call. -;
*- If it flows onto the following page then _vistext will be shown at the -;
*- start of the following page because it is defined as an "id" variable. -;
proc report missing headline headskip nowd split="@" spacing=2 data=_transtat2&wherecls;
    by labgrp labgrpx labnmor labnm labnmx &labunit;
    columns ("%sysfunc(repeat(%sysfunc(byte(131)),%eval(&repwidth-1)))"
              &trtvar _phase _period visno _vistext
            /* NOTE: _statkeys_ contains a list of the transposed numeric */
            /* variables but the character variables end in STR. Also the */
            /* numbers 1 and 2 will precede STR since we transposed by    */
            /* the _tp variable as well (see 1STR and 2STR below).        */
            ("_Value at visit_" %suffix(1STR,&_statkeys_))
            ("_Difference from %lowcase(&bslnlbl)_" %suffix(2STR,&_statkeys_))
            );
    define &trtvar / order order=internal noprint;
    define _phase / order order=internal noprint;
    define _period / order order=internal noprint;
    define visno   / order order=internal noprint;
    define _vistext / id order width=26 "Treatment/" " Visit" spacing=0;
    %do j=1 %to 2;
      %do i=1 %to %words(&_statkeys_);
        %let key=%scan(&_statkeys_,&i,%str( ))&j.STR;
        %let spac=;
        %if &i EQ 1 %then %let spac=spacing=&gap;
        define &key / display width=&ncolw &spac right;
      %end;
    %end;
    compute before &trtvar;
      *- _popfmt_ contains the identity of the -;
      *- format created by the %popfmt call    -;
      line @1 &trtvar &_popfmt_;
    endcompute;
    break after _phase / skip;
run;

Conclusion

You have seen how best to extract and handle lab data to keep run times at a minimum when preparing your data for a call to %unistats.

Use the "Back" button of your browser to return to the previous page.

contact the author

SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.