Tuesday, March 27, 2012

Making coded values in SAS more descriptive with PROC FORMAT

Very often, when dealing with SAS datasets, we are faced with procedure output that contains non-descriptive codes.  When we, for example, proc freq these coded variables, it would be more useful to see helpful human-intelligible descriptions instead of just the raw codes.  Sometimes we are supplied with lookup/dimension tables that contain code descriptions/definitions and sometimes we have to create these ourselves.  In any case, once we have lookup tables for the codes, it can be a good idea to make SAS formats from them and attach the formats to the appropriate variables in the detail-level dataset.

This post will show how to do that using a very simple made-up set of three tables as shown here:



The detail table SALES contains one record per sale.  The variables PRODUCT_ID and SIZE_CD indicate what product and what size product was sold.  Ten sample observations are shown above.  The 'data' library also contains a pair of lookup tables to translate the PRODUCT_ID into a description and SIZE_CD into a description.  The lookup tables are called PRODUCT_LOOKUP and SIZE_LOOKUP.

Here is SAS code to create the three tables.

libname data 'fill in your own physical path here';

data data.sales;
    length product_id $5. size_cd 3.;
    input product_id $ size_cd;
    datalines;
ICE 1
SNO 2
ICE 3
ICE 1
SNO 1
SAND 2
ICE 2
ICE 2
SNO 3
SNO 1
;
run;

data data.product_lookup;
    length product_id $5. description $20.;
    input product_id $ 1-5 description $ 6-25;
    datalines;
ICE  ICE CREAM
SNO  SNOW CONE
SAND ICE CREAM SANDWICH
;
run;

data data.size_lookup;
    length size_cd 3. description $20.;
    input size_cd description $;
    datalines;
1 SMALL
2 REGULAR
3 LARGE
;
run;

If we run a proc freq to crosstab products with sizes, we get this (first code, then output):

proc freq data=data.sales;
    tables product_id * size_cd;
run;

The FREQ Procedure

Table of product_id by size_cd

product_id     size_cd

Frequency|
Percent  |
Row Pct  |
Col Pct  |       1|       2|       3|  Total
---------+--------+--------+--------+
ICE      |      2 |      2 |      1 |      5
         |  20.00 |  20.00 |  10.00 |  50.00
         |  40.00 |  40.00 |  20.00 |
         |  50.00 |  50.00 |  50.00 |
---------+--------+--------+--------+
SAND     |      0 |      1 |      0 |      1
         |   0.00 |  10.00 |   0.00 |  10.00
         |   0.00 | 100.00 |   0.00 |
         |   0.00 |  25.00 |   0.00 |
---------+--------+--------+--------+
SNO      |      2 |      1 |      1 |      4
         |  20.00 |  10.00 |  10.00 |  40.00
         |  50.00 |  25.00 |  25.00 |
         |  50.00 |  25.00 |  50.00 |
---------+--------+--------+--------+
Total           4        4        2       10
            40.00    40.00    20.00   100.00

While we are actively working on the sales data project, we might easily retain in our memory the fact that size 1 is small, size 2 is regular, and size 3 is large, but what if we set aside this project for 6 months or longer and then come back to it?  What if we had hundreds of products and dozens of size options?  What if we need to hand this project off to a new analyst or present findings to other staff not familiar with the data?  It would make all of our lives easier if the proc freq output indicated what sizes 1, 2, and 3 mean, as well as what the product codes mean.  We have the data already to provide helpful descriptions (the two lookup tables), so let's work at incorporating them.

Using proc format with the cntlin parameter gives us the ability to create a format from a dataset.  In this case, the dataset needs only three fields:
  1. START - the code value,
  2. LABEL - the description associated with the code, and
  3. FMTNAME - the name of the format to be created (same value repeated every record).
To avoid needlessly creating a copy of the lookup data, I use proc sql views to create the proc format input dataset.  Views work just as well as datasets in this context; proc format does not care.  A view is simply a stored query.  Only the query definition is stored on disk, not an entire separate copy of the data.

Here's the query to define the proc format input data view for the product format:

proc sql;
    create view for_product_format as
        select product_id as start,
               strip(product_id)||' ('||strip(description)||')'
                   as label,
               '$product' as fmtname
        from data.product_lookup;
quit;

A proc print of the view (proc print data=for_product_format;  run;) looks like this:

     Obs    start    label                        fmtname

       1    ICE      ICE (ICE CREAM)              $product
       2    SNO      SNO (SNOW CONE)              $product
       3    SAND     SAND (ICE CREAM SANDWICH)    $product

Turning this data view into a format is as simple as:

proc format cntlin=for_product_format library=data;  run;

In response to running this code, the SAS log says:

NOTE: Format $PRODUCT has been written to DATA.FORMATS.

In the physical library pointed to by the 'data' libref, a new file called formats.sas7bcat (if using SAS versions 7-9, at least).  The following two lines of code inspect what's in that new format catalog:

proc catalog c=data.formats;  contents;  run;
proc format fmtlib library=data;  select $product;  run;

                Contents of Catalog DATA.FORMATS

#  Name    Type    Create Date      Modified Date    Description
----------------------------------------------------------------
1  PRODUCT FORMATC 26Mar12:15:54:52 26Mar12:15:54:52

----------------------------------------------------------------
| FORMAT NAME: $PRODUCT LENGTH:   25   NUMBER OF VALUES:   3   |
| MIN LENGTH: 1  MAX LENGTH: 40  DEFAULT LENGTH  25  FUZZ: 0   |
|--------------------------------------------------------------|
|START     |END       |LABEL  (VER. V7|V8   26MAR2012:15:54:53)|
|----------+----------+----------------------------------------|
|ICE       |ICE       |ICE (ICE CREAM)                         |
|SAND      |SAND      |SAND (ICE CREAM SANDWICH)               |
|SNO       |SNO       |SNO (SNOW CONE)                         |
----------------------------------------------------------------

The proc catalog output tells us there is one format in the catalog named PRODUCT and it's a character format (so it would be referred to in SAS code as $PRODUCT).  The proc format fmtlib output shows us the individual code and description pairs contained in the format.  As you can see, the constructed labels consist of the code values followed by the descriptions in parentheses.

Code to create a format for SIZE_CD is as follows:

proc sql;
    create view for_size_format as
        select size_cd as start,
         strip(put(size_cd,best.))||' ('||strip(description)||')'
                   as label,
               'size' as fmtname
        from data.size_lookup;
quit;

proc format cntlin=for_size_format library=data;  run;

SIZE_CD is numeric, whereas PRODUCT_ID was character, so SIZE_CD is converted to character with the put function using the built-in best format to make the label, and the name of the format is simply SIZE with no prefixing dollar sign.

Running the proc catalog code again (proc catalog c=data.formats;  contents;  run;) indicates that we do in fact now have two formats.

              Contents of Catalog DATA.FORMATS

#  Name    Type    Create Date      Modified Date   Description
----------------------------------------------------------------
1  SIZE    FORMAT  26Mar12:16:14:27 26Mar12:16:14:27
2  PRODUCT FORMATC 26Mar12:15:54:52 26Mar12:15:54:52


Only one step remains and that is to attach the formats to the variables in the SALES dataset.  That is accomplished with proc datasets as follows:

proc datasets lib=data nolist;
    modify sales;
        format product_id $product. size_cd size.;
quit;

If we run our original proc freq at this point, we will either get an error (if the fmterr option is in effect) or no error and simply the same unformatted values in the freq output.  The reason is that although we created the datasets, created the formats, and associated the formats with the dataset variables, SAS doesn't know where to look for the formats when it comes time to do the proc freq.  We have to give it a hint as follows:

options fmtsearch=(data);

The format search system option tells SAS to look for formats in the 'data' library.  Now when we run our proc freq, we get the desired result.

proc freq data=data.sales;
    tables product_id * size_cd;
run;

The FREQ Procedure

Table of product_id by size_cd

product_id        size_cd

Frequency        |
Percent          |
Row Pct          |
Col Pct          |1 (SMALL|2 (REGUL|3 (LARGE|  Total
                 |)       |AR)     |)       |
-----------------+--------+--------+--------+
ICE (ICE CREAM)  |      2 |      2 |      1 |      5
                 |  20.00 |  20.00 |  10.00 |  50.00
                 |  40.00 |  40.00 |  20.00 |
                 |  50.00 |  50.00 |  50.00 |
-----------------+--------+--------+--------+
SAND (ICE CREAM  |      0 |      1 |      0 |      1
SANDWICH)        |   0.00 |  10.00 |   0.00 |  10.00
                 |   0.00 | 100.00 |   0.00 |
                 |   0.00 |  25.00 |   0.00 |
-----------------+--------+--------+--------+
SNO (SNOW CONE)  |      2 |      1 |      1 |      4
                 |  20.00 |  10.00 |  10.00 |  40.00
                 |  50.00 |  25.00 |  25.00 |
                 |  50.00 |  25.00 |  50.00 |
-----------------+--------+--------+--------+
Total                   4        4        2       10
                    40.00    40.00    20.00   100.00

Since we created permanent datasets, permanent formats, and permanently associated the two, we do not have to have a format statement in the proc freq and this association will persist beyond the current SAS session.  So we can happily shut down SAS, start it up again and issue the following lines without error:

libname data 'fill in your own physical path here';
options fmtsearch=(data); *Remind SAS where the formats are;
proc print data=data.sales;  run;

Obs    product_id                     size_cd

  1    ICE (ICE CREAM)              1 (SMALL)
  2    SNO (SNOW CONE)              2 (REGULAR)
  3    ICE (ICE CREAM)              3 (LARGE)
  4    ICE (ICE CREAM)              1 (SMALL)
  5    SNO (SNOW CONE)              1 (SMALL)
  6    SAND (ICE CREAM SANDWICH)    2 (REGULAR)
  7    ICE (ICE CREAM)              2 (REGULAR)
  8    ICE (ICE CREAM)              2 (REGULAR)
  9    SNO (SNOW CONE)              3 (LARGE)
 10    SNO (SNOW CONE)              1 (SMALL)

One final note.  I find it helpful to tailor the description/label portion of the formats to include the actual code value (e.g., 2) and the descriptive value (e.g., SMALL).  The reason is that while the description is the more --um-- descriptive portion of the format label, the underlying SALES dataset still only really contains the coded values.  Applying a format changes how data is displayed, not the stored values.  Thus, to subset the small product transactions, you would use this where statement where size_cd=1 rather than this where statement where size_cd='1 (SMALL)' and having the 1 right in the label saves us having to look that up what "SMALL" means in terms of coded data values.

Monday, March 19, 2012

SAS Proc SQL Pass-through Connection Strings

I work in the health care research department of a managed health care organization in the Southeastern U.S.  SAS is a mainstay in the industry, and this is certainly true for my current work environment.  For the analysts in my group, SAS is the data manipulation and analysis tool.  The primary data repository that I manage exists in the form of SAS datasets.

However, much of the source system data from which the research data warehouse is built lives on other platforms, so I do a lot of ETL'ing from Teradata, Oracle, Sybase, and MSSQL into SAS.  For small stuff, like dimension/lookup tables, using libref access to move data into SAS is acceptable, performance-wise.  For example...

libname o oracle user=&oracle_uid password="&oracle_pwd"
                 path='oracle_path' schema='oracle_schema';
proc sql;

    create table sas_copy_of_table as
        select * from o.table;
quit;


But for larger pulls, especially having a where clause, libref access can be slow since SAS may choose to pull the entire data table over the network before applying the where clause.

(Aside: use the SASTRACE and SASTRACELOC system options to see if SAS is optimizing cross-platform database queries behind the scenes.)

The alternative is to use SAS pass-through queries.  These are queries written in the database's flavor of SQL (instead of the SAS flavor of SQL) that are passed off from SAS to the other database engine.  Only the final where-filtered results are returned to SAS.
 
There are two areas in which I've seen lots of heartache writing and debugging SQL pass-through queries in SAS.  First, getting the connection string just right, given how finicky SAS is, can be painful.  That is the focus of the rest of this post.  The other curse-inducing area is getting the right database-compliant (NOT SAS-compliant!) syntax into the query itself (e.g., employing Oracle--not SAS--functions and operators for date comparisons).  I plan to discuss this issue in a future post.

Without further ado, here are examples for Oracle, Sybase, Teradata, MSSQL (via OLEDB), and MSSQL (via ODBC).  Assume the *_uid and *_pwd macro variables (e.g., oracle_uid and oracle_pwd) contain database credentials and were declared in an include file prior to the code shown.  Each proc sql block creates a new work dataset called sample by selecting the top 10 observations from a database table called dbtable.

proc sql;
    connect to oracle (user=&oracle_uid

                       password="&oracle_pwd"
                       path='oracle_path');
    create table sample as select * from connection to oracle (
        select * from the_schema.dbtable where rownum<11);

quit;

proc sql;
    connect to sybase (user=&sybase_uid

                       password="&sybase_pwd"
                       server=syb_server
                       database=syb_db);
    create table sample as select * from connection to sybase (
        select top 10 * from dbtable);
quit;

proc sql;
    connect to teradata (user=&teradata_uid

                         password="&teradata_pwd"
                         tdpid='teradata_server');
    create table sample as select * from connection to teradata (
        select top 10 * from dbtable);
quit;

proc sql;
    connect to oledb (init_string="User ID=&mssql_uid;Password=&mssql_pwd;Provider=SQLOLEDB.1;Data Source=mssql_server;Initial Catalog=mssql_db");    

    create table sample as select * from connection to oledb (
        select top 10 * from the_schema.dbtable); 

quit;

proc sql; *SQL 2008;
    connect to odbc (noprompt="Uid=&mssql_uid;Pwd=&mssql_pwd;Driver={SQL Server Native Client 10.0};Server=mssql_server;Database=mssql_database");    

    create table sample as select * from connection to odbc (
        select top 10 * from the_schema.dbtable);

quit;

The website connectionstrings.com is a great help in figuring out driver names, etc.  Also, if you are unsure what SAS database engines you have installed, run proc setinit; run; and look in the log for entries like "SAS/ACCESS Interface to DB2," "SAS/ACCESS Interface to ORACLE" and so on.

Restarting blogging activity

I've decided to renew blogging about the technology that I use as a data warehouse manager and application developer.  My prior blog is technoyoga.blogspot.com.  I wanted to start fresh here rather than continue that one since there's been such a big break.

The name "obs-depot" is a sort of synonym for data warehouse.  Much of my work is with SAS and SAS calls records/rows "observations."  Depot is simply a synonym for warehouse or repository.  To state the obvious.  ;)

Cheers,
David