Improving Bootstrap Performance of SAS Code

Author: Roland Rashleigh-Berry
Updated: 20 Nov 2011

Introduction

Bootstrapping is becoming increasingly popular as a technique. There are many good articles on this that a Google search will reveal. This article is not about how to use the technique but rather how to get optimum performance when doing it to reduce run times to a minimum. The difference in CPU and elapsed time from doing it the optimum way compared to the worst way can be a factor of 4. In extreme cases, bootstrap processing can take more than a day to run. If you can reduce that to a quarter of that time then it is well worth striving for.

What slows it down the most?

What slows bootstrapping down the most is calling sas procedures multiple times instead of calling them once using BY processing. Programmers often call these procedures in a macro loop, maybe trying to avoid creating a dataset with millions of observations in it, but this is actually the best approach. Datasets with a million or more observations in them are not a problem. Even having ten million observations is okay. Invoking a procedure such as "proc logistic" is expensive in terms of time and resources. It is a lot of work to load the procedure and to clean up after it ends. It is better to call it once using BY processing than it is to call it maybe ten thousand times for each generated sample. In the case of "proc logistic" you will reduce the run time to one quarter if you call the procedure only once using BY processing rather than call it multiple times.

SASFILE Statement

Your main performance gain is in calling sas procedure only once using BY processing . But there is another less known and rarely used facility in SAS for when you are reading a sas dataset multiples times or doing random access using the point= option. When you are bootstrapping you are reading a dataset with maybe about 300 observations in each treatment arm and creating maybe a million observations by sampling. You have probably split the data into two datasets that correspond with the treatment arm such that one dataset contains the drug results and the other dataset is for placebo patients or for the other treatment arm and you generate samples from each dataset. In these circumstances you can get a huge performance boost using the SASFILE Statement. This loads the dataset into the computer's memory. It reads the dataset into memory at a normal speed but once it is in memory then it is six times faster to access the data. This technique can not be used for all your data. Your datasets must not be too large and it is only worth using this technique if you are reading the dataset multiple times (or making a very large number of random accesses that equate to reading it multiple times). It is done simply like this:
 
SASFILE mysas.dataset LOAD;

Read the dataset multiple times or access it randomly many times.....

SASFILE mysas.dataset CLOSE;

You should not neglect to issue the SASFILE CLOSE when you have finished reading the data.

Avoid macro looping

Looping macro code is very much slower and more resource hungry than data step looping so avoid macro looping whenever possible and give the looping work to the data step.
 
* Instead of doing it this way looping with a macro variable... ;

data sample;
  %do i=1 %to &loop;
    loop=&i;
    *- generate data -;
  %end;
  stop;
run;

* Do it this way by looping in the data step ;

data sample;
  do loop=1 to &loop;
    *- generate data -;
  end;
  stop;
run;
 

Avoid sorting the generated data

You will be generating maybe a few million records from each treatment arm and then bringing these two sample datasets together. This is a large amount of data to sort. But if you generate the data already sorted into the order you want then a sort can be avoided and instead the data combined using a set statement.
 
data treatall;
  set treat0 treat1;
  BY loop sampsize;
run;

proc logistic data=treatall etc.;
  BY loop sampsize;
run;

What your code might look like

It is difficult to guess what your code is supposed to look like but I will give a rough outline of what I am expecting and how these techniques are applied. I will just deal with running the analysis on the generated data.
 
*- split the data into the different treatment arms -;

data treat0 treat1;
  set results;
  if trt=0 then output treat0;
  else output treat1;
run;

*- define looping macro -;
%macro loop(loops=,samplow=,samphigh=,incr=,seed=);

  *- load trt=0 data into memory -;
  sasfile treat0 load;

  *- generate samples -;
  data samp0;
    do loop=1 to &loops;
      do sampsize=&samplow to &samphigh by &incr;
        do i=1 to sampsize;
          set treat0 point=ceil(ranuni(&seed)*totobs) nobs=totobs;
          output;
        end;
      end;
    end;
    stop;
    drop i;
  run;

  *- free data from memory -;
  sasfile treat0 close;
 

  *- load trt=1 data into memory -;
  sasfile treat1 load;

  *- generate samples -;
  data samp1;
    do loop=1 to &loops;
      do sampsize=&samplow to &samphigh by &incr;
        do i=1 to sampsize;
          set treat1 point=ceil(ranuni(&seed)*totobs) nobs=totobs;
          output;
        end;
      end;
    end;
    stop;
    drop i;
  run;

  *- free data from memory -;
  sasfile treat1 close;
 

  *- add sample data together -;
  data sampboth;
    set samp0 samp1;
    by loop sampsize;
  run;

  *- run whatever procedure is needed -;
  proc logistic /* or some other procedure */ data=sampboth;
    by loop sampsize;
    model statement or whatever;
  run;

%mend loop;

*- call the macro -;
%loop(loops=1000,samplow=50,samphigh=300,incr=50,seed=99);

 

Conclusion

On this page I have shown how to get optimum speed for running bootstrap processing. There is a speed factor of 4 between worst case and best case. The major saving comes from using BY processing and calling sas procedures only once.
 


 

Use the "Back" button of your browser to return to the previous page