Tuesday, March 27, 2012

Making coded values in SAS more descriptive with PROC FORMAT

Very often, when dealing with SAS datasets, we are faced with procedure output that contains non-descriptive codes.  When we, for example, proc freq these coded variables, it would be more useful to see helpful human-intelligible descriptions instead of just the raw codes.  Sometimes we are supplied with lookup/dimension tables that contain code descriptions/definitions and sometimes we have to create these ourselves.  In any case, once we have lookup tables for the codes, it can be a good idea to make SAS formats from them and attach the formats to the appropriate variables in the detail-level dataset.

This post will show how to do that using a very simple made-up set of three tables as shown here:



The detail table SALES contains one record per sale.  The variables PRODUCT_ID and SIZE_CD indicate what product and what size product was sold.  Ten sample observations are shown above.  The 'data' library also contains a pair of lookup tables to translate the PRODUCT_ID into a description and SIZE_CD into a description.  The lookup tables are called PRODUCT_LOOKUP and SIZE_LOOKUP.

Here is SAS code to create the three tables.

libname data 'fill in your own physical path here';

data data.sales;
    length product_id $5. size_cd 3.;
    input product_id $ size_cd;
    datalines;
ICE 1
SNO 2
ICE 3
ICE 1
SNO 1
SAND 2
ICE 2
ICE 2
SNO 3
SNO 1
;
run;

data data.product_lookup;
    length product_id $5. description $20.;
    input product_id $ 1-5 description $ 6-25;
    datalines;
ICE  ICE CREAM
SNO  SNOW CONE
SAND ICE CREAM SANDWICH
;
run;

data data.size_lookup;
    length size_cd 3. description $20.;
    input size_cd description $;
    datalines;
1 SMALL
2 REGULAR
3 LARGE
;
run;

If we run a proc freq to crosstab products with sizes, we get this (first code, then output):

proc freq data=data.sales;
    tables product_id * size_cd;
run;

The FREQ Procedure

Table of product_id by size_cd

product_id     size_cd

Frequency|
Percent  |
Row Pct  |
Col Pct  |       1|       2|       3|  Total
---------+--------+--------+--------+
ICE      |      2 |      2 |      1 |      5
         |  20.00 |  20.00 |  10.00 |  50.00
         |  40.00 |  40.00 |  20.00 |
         |  50.00 |  50.00 |  50.00 |
---------+--------+--------+--------+
SAND     |      0 |      1 |      0 |      1
         |   0.00 |  10.00 |   0.00 |  10.00
         |   0.00 | 100.00 |   0.00 |
         |   0.00 |  25.00 |   0.00 |
---------+--------+--------+--------+
SNO      |      2 |      1 |      1 |      4
         |  20.00 |  10.00 |  10.00 |  40.00
         |  50.00 |  25.00 |  25.00 |
         |  50.00 |  25.00 |  50.00 |
---------+--------+--------+--------+
Total           4        4        2       10
            40.00    40.00    20.00   100.00

While we are actively working on the sales data project, we might easily retain in our memory the fact that size 1 is small, size 2 is regular, and size 3 is large, but what if we set aside this project for 6 months or longer and then come back to it?  What if we had hundreds of products and dozens of size options?  What if we need to hand this project off to a new analyst or present findings to other staff not familiar with the data?  It would make all of our lives easier if the proc freq output indicated what sizes 1, 2, and 3 mean, as well as what the product codes mean.  We have the data already to provide helpful descriptions (the two lookup tables), so let's work at incorporating them.

Using proc format with the cntlin parameter gives us the ability to create a format from a dataset.  In this case, the dataset needs only three fields:
  1. START - the code value,
  2. LABEL - the description associated with the code, and
  3. FMTNAME - the name of the format to be created (same value repeated every record).
To avoid needlessly creating a copy of the lookup data, I use proc sql views to create the proc format input dataset.  Views work just as well as datasets in this context; proc format does not care.  A view is simply a stored query.  Only the query definition is stored on disk, not an entire separate copy of the data.

Here's the query to define the proc format input data view for the product format:

proc sql;
    create view for_product_format as
        select product_id as start,
               strip(product_id)||' ('||strip(description)||')'
                   as label,
               '$product' as fmtname
        from data.product_lookup;
quit;

A proc print of the view (proc print data=for_product_format;  run;) looks like this:

     Obs    start    label                        fmtname

       1    ICE      ICE (ICE CREAM)              $product
       2    SNO      SNO (SNOW CONE)              $product
       3    SAND     SAND (ICE CREAM SANDWICH)    $product

Turning this data view into a format is as simple as:

proc format cntlin=for_product_format library=data;  run;

In response to running this code, the SAS log says:

NOTE: Format $PRODUCT has been written to DATA.FORMATS.

In the physical library pointed to by the 'data' libref, a new file called formats.sas7bcat (if using SAS versions 7-9, at least).  The following two lines of code inspect what's in that new format catalog:

proc catalog c=data.formats;  contents;  run;
proc format fmtlib library=data;  select $product;  run;

                Contents of Catalog DATA.FORMATS

#  Name    Type    Create Date      Modified Date    Description
----------------------------------------------------------------
1  PRODUCT FORMATC 26Mar12:15:54:52 26Mar12:15:54:52

----------------------------------------------------------------
| FORMAT NAME: $PRODUCT LENGTH:   25   NUMBER OF VALUES:   3   |
| MIN LENGTH: 1  MAX LENGTH: 40  DEFAULT LENGTH  25  FUZZ: 0   |
|--------------------------------------------------------------|
|START     |END       |LABEL  (VER. V7|V8   26MAR2012:15:54:53)|
|----------+----------+----------------------------------------|
|ICE       |ICE       |ICE (ICE CREAM)                         |
|SAND      |SAND      |SAND (ICE CREAM SANDWICH)               |
|SNO       |SNO       |SNO (SNOW CONE)                         |
----------------------------------------------------------------

The proc catalog output tells us there is one format in the catalog named PRODUCT and it's a character format (so it would be referred to in SAS code as $PRODUCT).  The proc format fmtlib output shows us the individual code and description pairs contained in the format.  As you can see, the constructed labels consist of the code values followed by the descriptions in parentheses.

Code to create a format for SIZE_CD is as follows:

proc sql;
    create view for_size_format as
        select size_cd as start,
         strip(put(size_cd,best.))||' ('||strip(description)||')'
                   as label,
               'size' as fmtname
        from data.size_lookup;
quit;

proc format cntlin=for_size_format library=data;  run;

SIZE_CD is numeric, whereas PRODUCT_ID was character, so SIZE_CD is converted to character with the put function using the built-in best format to make the label, and the name of the format is simply SIZE with no prefixing dollar sign.

Running the proc catalog code again (proc catalog c=data.formats;  contents;  run;) indicates that we do in fact now have two formats.

              Contents of Catalog DATA.FORMATS

#  Name    Type    Create Date      Modified Date   Description
----------------------------------------------------------------
1  SIZE    FORMAT  26Mar12:16:14:27 26Mar12:16:14:27
2  PRODUCT FORMATC 26Mar12:15:54:52 26Mar12:15:54:52


Only one step remains and that is to attach the formats to the variables in the SALES dataset.  That is accomplished with proc datasets as follows:

proc datasets lib=data nolist;
    modify sales;
        format product_id $product. size_cd size.;
quit;

If we run our original proc freq at this point, we will either get an error (if the fmterr option is in effect) or no error and simply the same unformatted values in the freq output.  The reason is that although we created the datasets, created the formats, and associated the formats with the dataset variables, SAS doesn't know where to look for the formats when it comes time to do the proc freq.  We have to give it a hint as follows:

options fmtsearch=(data);

The format search system option tells SAS to look for formats in the 'data' library.  Now when we run our proc freq, we get the desired result.

proc freq data=data.sales;
    tables product_id * size_cd;
run;

The FREQ Procedure

Table of product_id by size_cd

product_id        size_cd

Frequency        |
Percent          |
Row Pct          |
Col Pct          |1 (SMALL|2 (REGUL|3 (LARGE|  Total
                 |)       |AR)     |)       |
-----------------+--------+--------+--------+
ICE (ICE CREAM)  |      2 |      2 |      1 |      5
                 |  20.00 |  20.00 |  10.00 |  50.00
                 |  40.00 |  40.00 |  20.00 |
                 |  50.00 |  50.00 |  50.00 |
-----------------+--------+--------+--------+
SAND (ICE CREAM  |      0 |      1 |      0 |      1
SANDWICH)        |   0.00 |  10.00 |   0.00 |  10.00
                 |   0.00 | 100.00 |   0.00 |
                 |   0.00 |  25.00 |   0.00 |
-----------------+--------+--------+--------+
SNO (SNOW CONE)  |      2 |      1 |      1 |      4
                 |  20.00 |  10.00 |  10.00 |  40.00
                 |  50.00 |  25.00 |  25.00 |
                 |  50.00 |  25.00 |  50.00 |
-----------------+--------+--------+--------+
Total                   4        4        2       10
                    40.00    40.00    20.00   100.00

Since we created permanent datasets, permanent formats, and permanently associated the two, we do not have to have a format statement in the proc freq and this association will persist beyond the current SAS session.  So we can happily shut down SAS, start it up again and issue the following lines without error:

libname data 'fill in your own physical path here';
options fmtsearch=(data); *Remind SAS where the formats are;
proc print data=data.sales;  run;

Obs    product_id                     size_cd

  1    ICE (ICE CREAM)              1 (SMALL)
  2    SNO (SNOW CONE)              2 (REGULAR)
  3    ICE (ICE CREAM)              3 (LARGE)
  4    ICE (ICE CREAM)              1 (SMALL)
  5    SNO (SNOW CONE)              1 (SMALL)
  6    SAND (ICE CREAM SANDWICH)    2 (REGULAR)
  7    ICE (ICE CREAM)              2 (REGULAR)
  8    ICE (ICE CREAM)              2 (REGULAR)
  9    SNO (SNOW CONE)              3 (LARGE)
 10    SNO (SNOW CONE)              1 (SMALL)

One final note.  I find it helpful to tailor the description/label portion of the formats to include the actual code value (e.g., 2) and the descriptive value (e.g., SMALL).  The reason is that while the description is the more --um-- descriptive portion of the format label, the underlying SALES dataset still only really contains the coded values.  Applying a format changes how data is displayed, not the stored values.  Thus, to subset the small product transactions, you would use this where statement where size_cd=1 rather than this where statement where size_cd='1 (SMALL)' and having the 1 right in the label saves us having to look that up what "SMALL" means in terms of coded data values.

No comments:

Post a Comment