Combining character variables with different lengths
[This site is not connected with the SAS Institute]
[last updated - 06 September 2003]
Introduction
You know how our job gets complicated and messy sometimes? This page is
about one of those times. It is to do with when you get data sets from
different sources. Sometimes you will find that the lengths of identically-named
character variables differ in your input data sets. You have got to
be very careful in these circumstances. Let us say you are bringing together
a number of data sets in a "set" statement. As you will know, the name
(case-wise), length and other characteristics of the final variable in
the output data sets will use the characteristics in the first data set
in the list. But supposing you were doing an Integrated Safety Summary
(ISS) from a number of studies then it could be that one of these studies
has had to change the standard length of a character field for some reason.
And if you are relying on using the length of that variable from the first
data set then it is possible that important information could be lost when
longer variables of the same name in other data sets are combined together.
%clength
Because this is a potentially serious problem I wrote a macro to get round
it. But you have got to understand how it works and what it does. It is
called clength. You give it the list of data
sets that you are going to combine using a "set" statement (i.e. one after
the other rather than doing a merge) and it checks all the character variables
in those data sets to see if there is a clash of character variable lengths.
If it finds a clash in one or more variables then it will create a "length"
statement for you so that you can include it at the top of your data statement
that defines the data set that contains all the input data sets. It will
pick the maximum length for each character variable if there is a clash.
And because variable names can vary in case, and traditionally you use
what is in the first input data set in the list, then it will keep that
variable name case from the first data set in the list. If you look at
it, you will see that it is a quite complex macro.
Optimizing character variable lengths
It is possible to optimize character variable lengths, but I don't recommend
that you do it in an indiscriminate way. It is better if a character variable
has a length associated with it that is consistent across all data sets
with that variable. But it is maybe useful to know where a character variable
has been assigned a length that is inappropriate with the contents length.
To this end I wrote a macro called optlength.
This generates a "length" statement for trimming the length of character
variables, if you want to do so, down to the maximum length of the contents
found. But please note that I do not recommend you do this. Character variables
should have a fixed length that should be consistent across other data
sets having the same variable. If you change it to something shorter, then
you are maybe causing problems for other programmers who might receive
those data sets and are unaware you have made a change in variable length.
I am so opposed to making changes like this that the optlength macro will
only generate a length statement for you. It will not apply it. You have
to do this yourself in a following data step and should only do so if it
is appropriate. Because some character variables should never have their
lengths changed no matter what then I wrote a macro to "chomp" out a variable
name and its associated information from the macro string created by the
optlength macro and I called it chompw. It "chomps"
words out of the string so you can specify the variable name and also "chomp"
out its associated "$" and length information in the next two words. I've
got lots of macros on my web site that seem weird and inapplicable but
every single one of them is there for a purpose and at some stage you will
find it has a practical use.
Conclusion
In concluding, I will re-iterate my warning that the same-named variable
coming from different data sets may have different lengths and by default
you should use the longest length. I wrote the clength
macro especially for this and I recommend that you use it. I have written
other tools that will allow you to optimize the length of character variables
but would warn against their indiscriminate use. I have deliberately pulled
back the power of these macros so you have got to deliberately use this
power by coding an extra data step yourself.