LOCF Processing

(Author: Roland Rashleigh-Berry                                                                            Date: 10 Jun 2006)

Introduction

"LOCF" stands for "Last Observation Carried Forward", right? Well, "yes" and "no". It doesn't mean last SAS® software dataset observation carried forward. It means last non-missing value carried forward. And if you have multiple variables containing these values then they will be carried forward independently. If you are a SAS programmer, but not a statistician, it is very easy to fall into the trap of thinking the "observation" being carried forward is an observation in a SAS software dataset. It isn't. And if you code on that basis then you will be making a mistake. It is the non-missing values that are being carried forward and they are being carried forward independently at that. This makes things an awful lot more complicated so I wrote a macro to do this which you can link to below.
locf

I have seen code to do this processing using arrays. In one sense this is a good idea since it is the most efficient way of handling it. However, the solution using this method never struck me as being elegant. I feel the array method springs from an incorrect assumption of what can be achieved using merges. I'll explain in the next section.

LOCF Merging

For some reason, some programmers think a merge, as shown below, can not be done using SAS software.
 
Values to carry forward (values dataset) 
 date     value 
01jan02     1 
01mar02     2 
01may02     3 

Dates to merge with (dates dataset) 
 date 
01jan02 
01feb02 
01mar02 
01apr02 
01may02 
01jun02 

Resulting dataset (final dataset) 
 date     value 
01jan02     1 
01feb02     1 
01mar02     2 
01apr02     2 
01may02     3 
01jun02     3 

It is true that such a merge can not be done using only the variables shown above but what we can do is add a sequence number to the "dates" dataset and increment it when we get a match on date with the "values" dataset like this:
 
data dates; 
  retain _seq 0; 
  merge dates values(in=_val); 
  by date; 
  if _val then _seq=_seq+1; 
run; 

And now we have a sequence number that retains its value corresponding to each observation in the values dataset. So if we simply add the same sequence number to the "values" dataset like this:
 
data values; 
  set values; 
  _seq=_n_; 
run; 

.... then we can merge like this:
 
data final; 
  merge dates values; 
  by _seq; 
  drop _seq; 
run; 

... and this gives us the type of merge we need for LOCF merging. If you need further convincing, then here is some code you can copy and paste and run.
 
data values; 
  date='01jan2002'd;value=1;output; 
  date='01mar2002'd;value=2;output; 
  date='01may2002'd;value=3;output; 
  format date date7.; 
run; 

data dates; 
  date='01jan2002'd;output; 
  date='01feb2002'd;output; 
  date='01mar2002'd;output; 
  date='01apr2002'd;output; 
  date='01may2002'd;output; 
  date='01jun2002'd;output; 
  format date date7.; 
run; 

data dates; 
  retain _seq 0; 
  merge dates values(in=_val keep=date); 
  by date; 
  if _val then _seq=_seq+1; 
run; 

data values; 
  set values; 
  _seq=_n_; 
run; 

data final; 
  merge dates values; 
  by _seq; 
  drop _seq; 
run; 

options nocenter; 
proc print data=final; 
run; 

And if you look at the proc print output then this is what you will see. And I think you will agree that this is correct.
 
Obs       date    value 

 1     01JAN02      1 
 2     01FEB02      1 
 3     01MAR02      2 
 4     01APR02      2 
 5     01MAY02      3 
 6     01JUN02      3 

So at least, in principle, it is possible to merge values in with a time sequence using a sequence.variable we have created. The above is very simple test data, of course, but we can use the same principle in more complex cases. In reality we will have an owner variable such as a subject-id that changes. There may be subcategories like a lab parameter. We would probably use visit numbers instead of dates. But the same principle of adding a sequence number to both datasets and merging using that sequence number could still apply. And it is this principle I have used in my macro solution for this sort of processing.

More Complex LOCF Merging

Let's drop the use of dates and assume we have visit numbers that ascend with time. Also that we have multiple subjects and that we are dealing with lab data with multiple measures. Suppose we have datasets like the following:
 
(values dataset) 
subj  param  visit  value 
 AA    RBC     2      22 
 AA    RBC     4      44 
 AA    WBC     3      33 
 BB    ALB     3      55 

(visits dataset) 
subj  param  visit 
 AA    RBC     2 
 AA    RBC     3 
 AA    RBC     4 
 AA    RBC     5 
 AA    WBC     2 
 AA    WBC     3 
 AA    WBC     4 
 AA    WBC     5 
 BB    ALB     2 
 BB    ALB     3 
 BB    ALB     4 
 BB    ALB     5 

Now before we decide on how to merge these datasets, take a look at the "visits" dataset. Visits 2-5 are the "on-treatment" visits right up to the planned end point which is visit 5. It is for these visits we want to carry forward our last observation. Visit 1 is not shown because it is likely a baseline value and because it was not a measure taken while on and as a result of treatment we would probably not want to carry forward these baseline values. So Visits 2-5 are the "eligible" visit values for carrying forward. Now look at the values dataset. Again these only have the "eligible" visits and their values and these will be the values we will be carrying forward where required. Note that the subj/param pairings in the "visits" dataset match those in the values dataset. It makes sense to only have subj/param pairings to match values otherwise you would have nothing to carry forward. But this is an assumption. You should understand the assumptions being made. Sometimes the assumptions will not be approppriate for a study you are working on and you might have to change the logic.

Now consider how we are going to approach the merge of the two datasets. Look at the data for subject AA. We would not want to carry forward values of RBC into the following WBC so obviously we will be merging using matching on the subj/param pairing. We will have to reset our sequence count at the start of each subj/param grouping and increment from there. And when we add the sequence number to the "values" dataset we will have to do the same so that when we merge using subj/param/_seq we will get the correct matching.

First read through the following code until you understand what it is doing. The first two data steps generate the observations as shown above. The next two data steps add the sequence number and the final data step merges the data before printing it. Then copy and paste the code, run it under SAS software and convince yourself that it is producing the correct output:
 
data values; 
  subj='AA'; 
  param='RBC'; 
  visit=2;value=22;output; 
  visit=4;value=44;output; 
  param='WBC'; 
  visit=3;value=33;output; 
  subj='BB'; 
  param='ALB';
  visit=3;value=55;output; 
run; 

data visits; 
  subj='AA'; 
  do param='RBC', 'WBC'; 
    do visit=2 to 5; 
      output; 
    end; 
  end; 
  subj='BB'; 
  param='ALB'; 
  do visit=2 to 5; 
    output; 
  end; 
run; 

data visits; 
  retain _seq 0; 
  merge visits values(in=_val keep=subj param visit); 
  by subj param visit; 
  if first.param then _seq=0; 
  if _val then _seq=_seq+1; 
run; 

data values; 
  retain _seq 0; 
  set values; 
  by subj param; 
  if first.param then _seq=0; 
  _seq=_seq+1; 
run; 

data final; 
  merge values visits; 
  by subj param _seq; 
  drop _seq; 
run; 

options nocenter; 
proc print data=final; 
run; 

The Macro Solution - %locf

The macro I wrote to do this processing uses the method of merging shown above. It will operate on whatever data you feed into the macro. You should only feed it data that is eligible for carrying forward (for example, you would not feed in baseline visit data if baseline values were not eligible for carrying forward). It is your responsibility to prepare the data beforehand if needed. Non-missing for numeric values means not equal to missing value. Non-missing for a character variable means not a blank. If "0" signifies missing for some of your numeric variables then the dataset you feed in to the %locf macro will have to be an amended one with the 0's set to missing values. The same applies to the character variables. You have to do all the data preparation. The %locf macro does none. And if you want only one or two timepoints in the output dataset then you have to make that selection. The %locf macro will not do it for you. This way I can keep the macro as simple as possible and that way it should stay maintainable.

You should regard this macro as a solution rather than the solution. You need to understand what it is doing and what assumptions are being made. It may or may not match with your study requirements. You might end up writing a new macro that fits your site standards better, based on this macro. I can only go so far with it and keep it as generic and as simple as possible. If you want to develop it further then feel free but I will only be able to support the version you find here. To view the macro click on the link below.
locf
 
 

Use the "Back" button of your browser to return to the previous page

contact the author



SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
 
 

What the world says about data and values
HTML uploaded by GoFTP FREE