Monday, November 12, 2012

SAS options varlenchk=

How many times have you seen this warning in SAS?

WARNING: Multiple lengths were specified for the variable {VARNAME} by input data
set(s). This may cause truncation of data.

This can become a common warning when you're manipulating SAS datasets and resizing variables, for example, to meet data dictionary specifications. In general, you want to know about possible data truncation, but there are occasions when you know what the data look like and you purposefully want to shrink the length of a variable. For example, let's say you get a feed from another database system and the length of the variable in question is char(100). You know that the max length of data is char(20), so you want to shrink the data to size char(30). The name of the input dataset is called source_data.

data my_data;
    length text_field $30.; *char(100) in source_data dataset;
    set source_data;
run;

By itself, this will generate the warning, however if you surround the data step with options like this, the resizing warning will be suppressed and your SAS log will be kept clear of false warnings.

options varlenchk=nowarn; *temporarily turn off variable length checking since purposefully resizing vars;
data my_data;
    length text_field $30.; *char(100) in source_data dataset;
    set source_data;
run;
options varlenchk=warn; *turn variable length checking back on;

Wednesday, August 8, 2012

C# Extension Method to Reverse a String

Someone recently asked me to write some quick code to writing a reversed string to the console.  What I came up with at the time was this:

string s = "abcdefg";
foreach (char c in s.Reverse())
    Console.Write(c);


The string class has a Reverse method on it already, but it returns an IEnumerable, not the reversed string. 

After the fact, I looked at other ways of doing it and figured out a one-liner:

Console.WriteLine(String.Join("", s.ToCharArray().Reverse<char>()));

Going a step further (since I wanted to experiment a little with extension methods), I wrote this little string class extension method to encapsulate that functionality as follows:

public static class StringExtensions{
    public static string ReverseString(this string s)
    {
        return String.Join("", s.ToCharArray().Reverse<char>());
    }
}


Now I can simply write:

string s = "abcdefg";
Console.WriteLine(s.ReverseString());


The whole console program is as follows:

using System;
using System.Linq;


namespace ConsoleApplication1
{
    public static class StringExtensions
    {
        public static string ReverseString(this string s)
        {
            return String.Join("", s.ToCharArray().Reverse<char>());
        }
    }


    class Program
    {
        static void Main(string[] args)
        {
            string s = "abcdefg";
            Console.WriteLine(s.ReverseString());
            Console.ReadLine();
        }
    }
}


And since ReverseString is a an extension method extending the Framework string class, it shows up in intellisense like this as if it were a static method on the string class itself.  The down arrow indicates it's an extension method (among many already defined on string).

Tuesday, July 17, 2012

Date and Datetime Stamps in SAS, Perl, C#, and VB.NET

When repeating periodic tasks (hourly, daily, monthly, or whatever), it's often useful to date or datetime stamp the data files or log coming out of the process.  Here is simple code that I use to generate date and datetime stamps in various languages that I use regularly: SAS, Perl, C#, and VB.NET.

I prefer stamps in yyyymmdd and yyyymmdd_hhmmss formats (yyyy=4-digit year, mm=2-digit month, dd=2-digit day, hh=hour on 24-hour clock, mm=minute, ss=second) since these are readily sortable in chronological order.

SAS

*Create a picture format for making datetime stamps;
proc format;
    picture dtstamp low-high='%Y%0m%d_%0H%0M%0S'
                    (datatype=datetime);
run;

*Create a date and a datetime stamp macro variable;
data _null_;
    *Use built in yyyymmdd format;
    call symput('datestamp',put(date(),yymmddn8.));     
    *Use custom picture format;
    call symput('datetimestamp',compress(put(datetime(),dtstamp.))); 
run;

%put FYI: datestamp=&datestamp;
%put FYI: datetimestamp=&datetimestamp;

*Sample usage;
data daily_extract_&datestamp;
    *blah blah blah;
run;
data hourly_extract_&datetimestamp;
    *blah blah blah;
run;

Log says...

FYI: datestamp=20120717
FYI: datetimestamp=20120717_103230

NOTE: The data set WORK.DAILY_EXTRACT_20120717 has 1 observations and 0 variables.
NOTE: The data set WORK.HOURLY_EXTRACT_20120717_103230 has 1 observations and 0 variables.

Obviously bundling that code up in a couple of macros (%datestamp and %datetimestamp) is the way to go.

Perl

use POSIX qw(strftime);
my $datestamp = strftime("%Y%m%d", localtime);
my $datetimestamp = strftime("%Y%m%d_%H%M%S", localtime);
print "FYI: datestamp=$datestamp\n";
print "FYI: datetimestamp=$datetimestamp\n";

Output...

FYI: datestamp=20120717
FYI: datetimestamp=20120717_103230

C#

string datestamp = DateTime.Now.ToString("yyyyMMdd");
string datetimestamp = DateTime.Now.ToString("yyyyMMdd_HHmmss");
Console.WriteLine("FYI: datestamp={0}", datestamp);
Console.WriteLine("FYI: datetimestamp={0}", datetimestamp);

Output...

FYI: datestamp=20120717
FYI: datetimestamp=20120717_103230

VB.NET

Dim datestamp As String = DateTime.Now.ToString("yyyyMMdd")
Dim datetimestamp As String = DateTime.Now.ToString("yyyyMMdd_HHmmss")
Console.WriteLine("FYI: datestamp={0}", datestamp)
Console.WriteLine("FYI: datetimestamp={0}", datetimestamp)

Output...

FYI: datestamp=20120717
FYI: datetimestamp=20120717_103230

Friday, June 1, 2012

Scheduling SAS Program Runs on UNIX

Introduction


Scheduling SAS programs to run on a UNIX system can be accomplished in a number of ways.  I developed a Perl script to manage the process of running SAS code, parsing the SAS log, and emailing the results.  Instead of scheduling the SAS executable to run directly, I schedule the Perl script and it takes care of the rest.  Specifically, it does the following:
  1. It checks to make sure that the SAS .sas program file that you want to run actually exists,
  2. It runs the SAS program in batch mode and creates datetime-stamped .log and .lst files in the same directory as the .sas file,
  3. It checks the SAS log file for errors, warnings, uninitialized variable messages, and to see if the program ran all the way to the end, and
  4. It sends an email reporting what happened, along with attached copies of the .log file and the .lst files.

 The three ingredients of the solution are: 
  1. A .sas program (tweaked very slightly),
  2. The Perl script called runsas.run, and
  3. The crontab UNIX scheduling utility. 

 

Ingredient #1: The .sas Program File


Assuming you have a UNIX .sas program file in hand, add the following line to the very end of it.

    %put FINISHED;

This line writes the text FINISHED to the log.  The log parser (discussed later) that looks for errors and warnings also looks for this text in the .log file to ensure that SAS ran the program all the way to the end and didn't abort partway through.  I also highly recommend using the ERRORABEND system option so your program will abort/quit as soon as it hits an error.  This can save lots of time on program reruns, especially during development.

For illustration purposes, I'll use this trivial program:

*make a dataset that counts from 1 to 10;
data test
    do x = 1 to 10;
        output;
    end;
run;

With the two suggested edits, it looks like this:

options errorabend;


*make a dataset that counts from 1 to 10;
data test
    do x = 1 to 10;
        output;
    end;
run;


%put FINISHED;

 

Ingredient #2: runsas.run


I wrote the runsas.run Perl script to coordinate the SAS program execution, log parsing, and emailing process.  The full text of the script is at the end of this post.  I'm going to assume that you have the code of runsas.run (copied from the end of this post) saved in a text file named runsas.run stored in your home directory (i.e., ~/runsas.run) and that you have granted execution rights on that file by running chmod +x ~/runsas.run.

The runsas.run command is how we will execute a scheduled SAS program instead of invoking SAS directly.  When calling runsas.run, you provide it with two pieces of information:
  1. The fully specified name of the program (including which directory it's in -- and P.S. please do NOT have spaces in the directory or file names!) and
  2. your email address (or comma-delimited list of email addresses -- with no spaces).

Here's what runsas.run does: 
  1. It ensures that the program exists,
  2. It runs the SAS program in batch mode and creates datetime stamped .log and .lst files in the same directory as the .sas file,
  3. It checks the SAS log file for errors, warnings, uninitialized variable messages, and to see if the program ran all the way to the end, and
  4. It sends an email to the provided email address indicating how the program ran, along with a attached copies of the .log file and the .lst file (if a lst was produced by SAS).
To use the runsas.run command manually, which is recommended when first setting up a new scheduled program, log onto UNIX and navigate to your home directory (or wherever you put runsas.run).  Issue the following command (substituting in your program and email):

runsas.run /home/programs/test.sas davide@sample.com

This assumes your program called test.sas is in the /home/programs directory.  That command runs the SAS program immediately and you will either get an error message if you entered something incorrectly or you will receive an email when the SAS program finishes (run via the Perl script).  Do not run lengthy programs manually like this because runsas.run may time out in which case you will not get an email when the program finishes.  To send the notification email to multiple recipients, provide a comma-delimited list of email addresses (with no spaces), like: homer.simpson@simpsons.org,bart.simpson@simpsons.org.

 

Ingredient #3: crontab


Crontab is the built-in UNIX job scheduling utility.  There are numerous websites devoted to it such as  http://crontab.org/ and http://www.adminschoice.com/crontab-quick-reference.  Do a web search for "crontab reference" to find others if you wish to learn more beyond what will be explained here.

Scheduled crob jobs are user-specific, so your jobs will not collide with the jobs of other users and your jobs will run in your security context as if you had logged in and run them manually.

I recommend maintaining your cron schedule in a text file on UNIX.  For the ongoing example in this post, I'll use the file ~/crontab_jobs.txt as the cron schedule file.  That is, the filename is crontab_jobs.txt and it is located in your UNIX home directory.  The content of the file is somewhat hard to explain, so be patient if it takes a while to sink in.  Each line in the file represents a separate job and has information that tells UNIX when to run and what to run.  The when information is broken into the following components: minute of the hour (0-59), hour of the day (0-23), day of the month (1-31), month of the year (1-12), and day of week (0-6, where 0=Sunday).  You enter numbers or asterisks to specify the schedule.  Happily there is a website to help figure out the numbers and asterisks part: http://www.corntab.com/pages/crontab-gui.  Use this site to get help writing the when information.

The what information is simply the runsas.run command line (with the directory of runsas.run specified) including the SAS program name and your email address.

Let's say I want to run the sample program once a week on Sundays at 2am.  The website http://www.corntab.com/pages/crontab-gui tells me this:

0 2 * * 0 /usr/sbin/update-motd

I will replace the /usr/sbin/update-motd part of that with the runsas.run command, so let's focus on the first 5 pieces.



Taken together, this translates to running the command every Sunday of every month of the year at 2:00 AM.  Substituting in the correct runsas.run command line, I end up with the following one line (no line breaks) in my text file (beware of line wraps in this post):

0 2 * * 0  ~/runsas.run /home/programs/test.sas davide@sample.com > /dev/null

The > /dev/null bit of code at the end prevents the scheduler from sending an email to your UNIX email account every time the job is run.

At last it's time to actually schedule the job to run. Log onto UNIX and go to your home directory (or wherever the crontab_jobs.txt file is located).  Issue the following UNIX command to have UNIX read your text file into the crontab scheduling software:

crontab crontab_jobs.txt

Whatever was scheduled before is now replaced by the jobs listed in the text file. Run the command:

crontab -l

to see what jobs are currently scheduled.  Remember: the programs/schedules saved in your text file edits are not going to actually run until you execute the crontab crontab_jobs.txt command.   Your SAS program is now scheduled to run.  You don't have to be logged in for it to work.

 

Getting Results


After runsas.run runs, you should get an email.  The subject line indicates if the program had warnings, errors, etc. and the contents of the email summarizes that information.  Any errors, etc. that occur are included in the email and the .log and .lst file (if there is one) are attached.  Warnings considered to be "false alarms" are listed in the "Ignored error/warning lines" section of the email.  The original .log and .lst files are on UNIX in the SAS program directory. The datetime stamp (with time measured on a 24-hour clock) prevents reruns from overwriting prior .log and .lst files.  If your program works, but you forgot to put in the %put FINISHED; line at the end, the email will say so as well. And if your program runs with errors, warnings, and/or uninitialized variables notes, the email will tell you what the errors, etc. were.

 

Full Text of runsas.run 

#!/usr/bin/perl
use strict;
use warnings;
use lib "~/perl/lib"; #Custom add-on libs (contains MIME::Lite)
use MIME::Lite; #Email module found in custom add-on lib/MIME

#See http://search.cpan.org/~rjbs/MIME-Lite-3.028/lib/MIME/Lite.pm

use POSIX qw(strftime); #POSIX strftime format returns time formatted as a string
####################################################################################################
# Purpose: this perl script requires one command line parameter that is the name of the SAS program
#          (.sas filename extension not required and path of program should not be included) that
#          should be executed.  A notification email is sent with feedback on how the SAS program
#          performed once the program has finished.
#####################################################################################################

my $num_args = $#ARGV + 1; #Get # command line parameters into a var

if ($num_args == 2) { #Make sure that 2 command line arguments provided.

    my $start_time = time; #Record the current time (time this process started)
    my $datetimestamp = strftime "%Y%m%d_%H%M%S", localtime; #Create datetime stamp string

    #The first command line argument is the name of the SAS program to be executed
    my $sas_prog_file = $ARGV[0]; #Put first command line arg into a variable
    chomp($sas_prog_file); #Trim trailing spaces
    unless (-e $sas_prog_file) { die "SAS program ($sas_prog_file) does not exist!"; } #Make sure program exists
    $sas_prog_file =~ /^(.+\/)(\S+)(\.sas)$/i; #Match with 3-part regex: 1=path, 2=program name (w/o extension), 3=filename extension
    my $prog_path = $1; #First part is the program path
    my $prog_title = $2; #Second part is the program file title, last part is filename extension which isn't used

    #The second command line argument is the email address(es) to whom notification should be sent
    my $email_recipient = $ARGV[1]; #Put second command line arg into a local variable
    chomp($email_recipient); #Trim trailing spaces

    #Start a string that will contain notes about how the process went (process log) and will get
    #sent as the body of the notification email
    my $notes = sprintf "Start time: %s\n", scalar(localtime($start_time)); #Put start time first in the notes

    #Derive names of datetime-stamped SAS log and lst files from the SAS program file path and title
    my $sas_log_file = "$prog_path$prog_title.$datetimestamp.log";
    #chomp($sas_log_file); #Necessary for some reason to remove newline character
    my $sas_lst_file = "$prog_path$prog_title.$datetimestamp.lst";
    #chomp($sas_lst_file); #Necessary for some reason to remove newline character

    my $sas_exe = "/bin/sas/sas"; #Physical location of the SAS execution script/file (customize for your box)

    my $command_line = "$sas_exe -RSASUSER -noterminal -sysin $sas_prog_file -log $sas_log_file -print $sas_lst_file"; #Complete batch SAS command

    #Shell out and run sas synchronously with nohup (no hangups) command and > /dev/null to suppresses stdout feedback on the command line
    system("nohup $command_line > /dev/null");

    #Now that SAS finished running, check the results by parsing the SAS log file.
    #Error, warning, uninitialized variables, ignored errors/warnings, and presence of finish flag will be tracked
    my @error_lines = (); #Array of SAS log error lines
    my @warning_lines = (); #Array of SAS log warning lines
    my @uninit_lines = (); #Array of SAS log uninitialized variables lines
    my @ignored_lines = (); #Array of SAS log error/warning lines that are being ignored because they don't constitute "real" problems
    my $finished_flag = 0; #Did the program run until the end (where there is a %put FINISHED SAS statement)?

    #Loop through all lines of the SAS log looking for errors, warnings, and so on...
    my $line_counter = 0;
    open (LOGFILE, $sas_log_file) or die $!;
    while (my $line = ) { #Loop through every line of the SAS log file
        $line_counter = $line_counter + 1;

        #First check if line is a warning that can be ignored
        if (  $line =~ /^WARNING: Unable to copy SASUSER registry to WORK registry.*$/
            | $line =~ /^WARNING: No preassigned object definitions were found.*$/
            | $line =~ /^WARNING: In-database formatting is not available on the database.*$/
            | $line =~ /^WARNING: Data too long for column.*$/
            | $line =~ /^WARNING: The current setting of the DIRECT_EXE libname option will not allow this SQL statement.*$/
        ) {
            chomp($line);
            push @ignored_lines, "[$line_counter] $line";
        }
        #Check if a real error
        elsif ($line =~ /^ERROR/) {
            chomp($line);
            push @error_lines, "[$line_counter] $line";
        }
        #Check if a warning
        elsif ($line =~ /^WARNING/) {
            chomp($line);
            push @warning_lines, "[$line_counter] $line";
        }
        #Check if an uninitialized variable note
        elsif ($line =~ /uninitialized/) {
            chomp($line);
            push @uninit_lines, "[$line_counter] $line";
        }
        #Check if it is the finished flag
        elsif ($line =~ /FINISHED/) {
            $finished_flag = 1;
        }
    }
    close (LOGFILE); #Finished parsing the SAS log file

    #Add error lines to the notes, if any
    my $i = 0;
    my $tempcount = $#error_lines + 1;
    $notes = $notes . "\n# Error lines: $tempcount\n";
    if ($#error_lines >= 0) {
        for($i=0; $i <= $#error_lines; ++$i) {
            $notes = $notes . $error_lines[$i] . "\n";
        }
    }

    #Add warning lines to the notes, if any
    $tempcount = $#warning_lines + 1;
    $notes = $notes . "\n# Warning lines: $tempcount\n";
    if ($#warning_lines >= 0) {
        for($i=0; $i <= $#warning_lines; ++$i) {
            $notes = $notes . $warning_lines[$i] . "\n";
        }
    }

    #Add uninitialized variables lines to the notes, if any
    $tempcount = $#uninit_lines + 1;
    $notes = $notes . "\n# Uninitialized lines: $tempcount\n";
    if ($#uninit_lines >= 0) {
        for($i=0; $i <= $#uninit_lines; ++$i) {
            $notes = $notes . $uninit_lines[$i] . "\n";
        }
    }

    #Add ignored lines to the notes, if any
    $tempcount = $#ignored_lines + 1;
    $notes = $notes . "\n# Ignored error/warning lines: $tempcount\n";
    if ($#ignored_lines >= 0) {
        for($i=0; $i <= $#ignored_lines; ++$i) {
        $notes = $notes . $ignored_lines[$i] . "\n";
        }
    }

    #Add "finished flag" status to the notes if no errors, warnings, uninits and make email subject line.
    #Starting the email subject line with [SCHEDULED SAS PROGRAM] makes it easy to see these in the inbox.
    my $email_subject_line = "[SCHEDULED SAS PROGRAM] $prog_title ";
    if ($#error_lines == -1 && $#warning_lines == -1 && $#uninit_lines == -1) {
        if ($finished_flag == 1) {
            #All appears to have gone well
            $notes = $notes . "\nJob finished successfully!\n";
            $email_subject_line = $email_subject_line . "finished successfully";
        }
        else {
            #Something weird happened, or the %put FINISHED line is missing
            $notes = $notes . "\nNo errors, warnings, or uninitialized variables, but job did NOT finish!\n";
            $email_subject_line = $email_subject_line . "did NOT finish";
        }
    }
    else {
        if ($#error_lines == -1 && $#uninit_lines == -1 && $#warning_lines >= 0) {
            $email_subject_line = $email_subject_line . "finished with warnings";
        }
        elsif ($#error_lines == -1 && $#uninit_lines >= 0 && $#warning_lines == -1) {
            $email_subject_line = $email_subject_line . "finished with uninitialized variables";
        }
        else {
            $email_subject_line = $email_subject_line . "finished with errors";
        }
    }

    my $end_time = time; #Capture end of process time

    #Calculate human-readable elapsed time string
    my $elapsed_time = $end_time - $start_time; #In seconds
    my $hours = int($elapsed_time / 60 / 60);
    my $minutes = int(($elapsed_time-$hours*3600) / 60);
    my $seconds = $elapsed_time - $hours*3600 - $minutes*60;
    my $human_readable_elapsed_time = sprintf '%dh:%02dm:%02ds', $hours, $minutes, $seconds;

    $notes = $notes . sprintf "\nEnd time: %s\n", scalar(localtime($end_time));
    $notes = $notes . "Elapsed time: $human_readable_elapsed_time\n\n";

    #Send notification email

    #Create a new email
    my $msg = MIME::Lite->new(
        From     => "$email_recipient",
        To       => "$email_recipient",
        Subject  => "$email_subject_line",
        Type     => 'text/plain',
        Encoding => '8bit',
        Data     => "$notes"
    );

    #Attach the log file if it exists
    unless (!(-e $sas_log_file)) {
        $msg->attach(
            Type     => 'application/octet-stream',
            Encoding => 'base64',
            Path     => "$sas_log_file",
            Filename => "$prog_title.$datetimestamp.log"
        );
    }

    #Attach the lst file if it exists
    unless (!(-e $sas_lst_file)) {
        $msg->attach(
            Type     => 'application/octet-stream',
            Encoding => 'base64',
            Path     => "$sas_lst_file",
            Filename => "$prog_title.$datetimestamp.lst"
        );
    }

    #Send the email
    $msg->send;

    #End of process
}
else {
    #Program was not called with the expected 2 command line parameters
    die "2 command line arguments (1. SAS program, 2. email) expected and not found.\n";
}

Monday, April 23, 2012

Custom date formats in SAS

Let's suppose you have a dataset with a range of dates spread out over a 10-year period.  This data step generates a fake dataset with 1000 obs spread out over 01JAN2000 to near the end of 2009 (a spread of 3,650 days).

data fakedata;
    format date mmddyy10.;
    do i = 1 to 1000;
        date = int(365 * 10 * ranuni(12345)) + '01jan2000'd;
        output;
    end;  drop i;
run;

To view the distribution of values, you would likely do a proc freq like this:

proc freq data=fakedata;
    tables date;
    format date year4.;
run;

                                 Cumulative    Cumulative
DATE    Frequency     Percent     Frequency      Percent
---------------------------------------------------------
2000          86        8.60            86         8.60
2001          89        8.90           175        17.50
2002          83        8.30           258        25.80
2003         103       10.30           361        36.10
2004         118       11.80           479        47.90
2005         102       10.20           581        58.10
2006         115       11.50           696        69.60
2007          94        9.40           790        79.00
2008         108       10.80           898        89.80
2009         102       10.20          1000       100.00

But what if you want a more detailed breakdown of more recent dates?  Let's get a fancy and say you are fine with yearly counts for 2000-2005, but you want to see quarterly counts for 2006-2008 and monthly counts for 2009.  Without futzing with the data itself (e.g., creating a new string variable with formatted values of the DATE variable), you can get the job done easily by creating a custom format and applying that in a proc freq statement.  The proc format statement looks like this:

proc format;
    value mydtfmt low          - '31dec2005'd = [year4]
                  '01jan2006'd - '31dec2008'd = [yyq6]
                  '01jan2009'd - high         = [yymms7];
run;

This is saying:
  • for dates on or before 12/31/05, format as 4-digit years (e.g., "2003"), 
  • for dates from 1/1/06 to 12/31/08 format as 4-digit years plus quarter number separated by a letter Q (e.g., "2007Q3"), and 
  • for dates on or after 1/1/09 format as 4-digit years plus 2-digit month separated by a slash (e.g., "2009/07").  
A proc freq statement that uses this format (plus its output) looks like this:

proc freq data=fakedata;
    tables date;
    format date mydtfmt.;
 run;

                                    Cumulative    Cumulative
   DATE    Frequency     Percent     Frequency      Percent
------------------------------------------------------------
2000             86        8.60            86         8.60
2001             89        8.90           175        17.50
2002             83        8.30           258        25.80
2003            103       10.30           361        36.10
2004            118       11.80           479        47.90
2005            102       10.20           581        58.10
2006Q1           24        2.40           605        60.50
2006Q2           34        3.40           639        63.90
2006Q3           21        2.10           660        66.00
2006Q4           36        3.60           696        69.60
2007Q1           21        2.10           717        71.70
2007Q2           24        2.40           741        74.10
2007Q3           18        1.80           759        75.90
2007Q4           31        3.10           790        79.00
2008Q1           28        2.80           818        81.80
2008Q2           40        4.00           858        85.80
2008Q3           26        2.60           884        88.40
2008Q4           14        1.40           898        89.80
2009/01           6        0.60           904        90.40
2009/02          10        1.00           914        91.40
2009/03           5        0.50           919        91.90
2009/04          12        1.20           931        93.10
2009/05          11        1.10           942        94.20
2009/06          11        1.10           953        95.30
2009/07           2        0.20           955        95.50
2009/08           8        0.80           963        96.30
2009/09           8        0.80           971        97.10
2009/10           5        0.50           976        97.60
2009/11          12        1.20           988        98.80
2009/12          12        1.20          1000       100.00

Tuesday, March 27, 2012

Making coded values in SAS more descriptive with PROC FORMAT

Very often, when dealing with SAS datasets, we are faced with procedure output that contains non-descriptive codes.  When we, for example, proc freq these coded variables, it would be more useful to see helpful human-intelligible descriptions instead of just the raw codes.  Sometimes we are supplied with lookup/dimension tables that contain code descriptions/definitions and sometimes we have to create these ourselves.  In any case, once we have lookup tables for the codes, it can be a good idea to make SAS formats from them and attach the formats to the appropriate variables in the detail-level dataset.

This post will show how to do that using a very simple made-up set of three tables as shown here:



The detail table SALES contains one record per sale.  The variables PRODUCT_ID and SIZE_CD indicate what product and what size product was sold.  Ten sample observations are shown above.  The 'data' library also contains a pair of lookup tables to translate the PRODUCT_ID into a description and SIZE_CD into a description.  The lookup tables are called PRODUCT_LOOKUP and SIZE_LOOKUP.

Here is SAS code to create the three tables.

libname data 'fill in your own physical path here';

data data.sales;
    length product_id $5. size_cd 3.;
    input product_id $ size_cd;
    datalines;
ICE 1
SNO 2
ICE 3
ICE 1
SNO 1
SAND 2
ICE 2
ICE 2
SNO 3
SNO 1
;
run;

data data.product_lookup;
    length product_id $5. description $20.;
    input product_id $ 1-5 description $ 6-25;
    datalines;
ICE  ICE CREAM
SNO  SNOW CONE
SAND ICE CREAM SANDWICH
;
run;

data data.size_lookup;
    length size_cd 3. description $20.;
    input size_cd description $;
    datalines;
1 SMALL
2 REGULAR
3 LARGE
;
run;

If we run a proc freq to crosstab products with sizes, we get this (first code, then output):

proc freq data=data.sales;
    tables product_id * size_cd;
run;

The FREQ Procedure

Table of product_id by size_cd

product_id     size_cd

Frequency|
Percent  |
Row Pct  |
Col Pct  |       1|       2|       3|  Total
---------+--------+--------+--------+
ICE      |      2 |      2 |      1 |      5
         |  20.00 |  20.00 |  10.00 |  50.00
         |  40.00 |  40.00 |  20.00 |
         |  50.00 |  50.00 |  50.00 |
---------+--------+--------+--------+
SAND     |      0 |      1 |      0 |      1
         |   0.00 |  10.00 |   0.00 |  10.00
         |   0.00 | 100.00 |   0.00 |
         |   0.00 |  25.00 |   0.00 |
---------+--------+--------+--------+
SNO      |      2 |      1 |      1 |      4
         |  20.00 |  10.00 |  10.00 |  40.00
         |  50.00 |  25.00 |  25.00 |
         |  50.00 |  25.00 |  50.00 |
---------+--------+--------+--------+
Total           4        4        2       10
            40.00    40.00    20.00   100.00

While we are actively working on the sales data project, we might easily retain in our memory the fact that size 1 is small, size 2 is regular, and size 3 is large, but what if we set aside this project for 6 months or longer and then come back to it?  What if we had hundreds of products and dozens of size options?  What if we need to hand this project off to a new analyst or present findings to other staff not familiar with the data?  It would make all of our lives easier if the proc freq output indicated what sizes 1, 2, and 3 mean, as well as what the product codes mean.  We have the data already to provide helpful descriptions (the two lookup tables), so let's work at incorporating them.

Using proc format with the cntlin parameter gives us the ability to create a format from a dataset.  In this case, the dataset needs only three fields:
  1. START - the code value,
  2. LABEL - the description associated with the code, and
  3. FMTNAME - the name of the format to be created (same value repeated every record).
To avoid needlessly creating a copy of the lookup data, I use proc sql views to create the proc format input dataset.  Views work just as well as datasets in this context; proc format does not care.  A view is simply a stored query.  Only the query definition is stored on disk, not an entire separate copy of the data.

Here's the query to define the proc format input data view for the product format:

proc sql;
    create view for_product_format as
        select product_id as start,
               strip(product_id)||' ('||strip(description)||')'
                   as label,
               '$product' as fmtname
        from data.product_lookup;
quit;

A proc print of the view (proc print data=for_product_format;  run;) looks like this:

     Obs    start    label                        fmtname

       1    ICE      ICE (ICE CREAM)              $product
       2    SNO      SNO (SNOW CONE)              $product
       3    SAND     SAND (ICE CREAM SANDWICH)    $product

Turning this data view into a format is as simple as:

proc format cntlin=for_product_format library=data;  run;

In response to running this code, the SAS log says:

NOTE: Format $PRODUCT has been written to DATA.FORMATS.

In the physical library pointed to by the 'data' libref, a new file called formats.sas7bcat (if using SAS versions 7-9, at least).  The following two lines of code inspect what's in that new format catalog:

proc catalog c=data.formats;  contents;  run;
proc format fmtlib library=data;  select $product;  run;

                Contents of Catalog DATA.FORMATS

#  Name    Type    Create Date      Modified Date    Description
----------------------------------------------------------------
1  PRODUCT FORMATC 26Mar12:15:54:52 26Mar12:15:54:52

----------------------------------------------------------------
| FORMAT NAME: $PRODUCT LENGTH:   25   NUMBER OF VALUES:   3   |
| MIN LENGTH: 1  MAX LENGTH: 40  DEFAULT LENGTH  25  FUZZ: 0   |
|--------------------------------------------------------------|
|START     |END       |LABEL  (VER. V7|V8   26MAR2012:15:54:53)|
|----------+----------+----------------------------------------|
|ICE       |ICE       |ICE (ICE CREAM)                         |
|SAND      |SAND      |SAND (ICE CREAM SANDWICH)               |
|SNO       |SNO       |SNO (SNOW CONE)                         |
----------------------------------------------------------------

The proc catalog output tells us there is one format in the catalog named PRODUCT and it's a character format (so it would be referred to in SAS code as $PRODUCT).  The proc format fmtlib output shows us the individual code and description pairs contained in the format.  As you can see, the constructed labels consist of the code values followed by the descriptions in parentheses.

Code to create a format for SIZE_CD is as follows:

proc sql;
    create view for_size_format as
        select size_cd as start,
         strip(put(size_cd,best.))||' ('||strip(description)||')'
                   as label,
               'size' as fmtname
        from data.size_lookup;
quit;

proc format cntlin=for_size_format library=data;  run;

SIZE_CD is numeric, whereas PRODUCT_ID was character, so SIZE_CD is converted to character with the put function using the built-in best format to make the label, and the name of the format is simply SIZE with no prefixing dollar sign.

Running the proc catalog code again (proc catalog c=data.formats;  contents;  run;) indicates that we do in fact now have two formats.

              Contents of Catalog DATA.FORMATS

#  Name    Type    Create Date      Modified Date   Description
----------------------------------------------------------------
1  SIZE    FORMAT  26Mar12:16:14:27 26Mar12:16:14:27
2  PRODUCT FORMATC 26Mar12:15:54:52 26Mar12:15:54:52


Only one step remains and that is to attach the formats to the variables in the SALES dataset.  That is accomplished with proc datasets as follows:

proc datasets lib=data nolist;
    modify sales;
        format product_id $product. size_cd size.;
quit;

If we run our original proc freq at this point, we will either get an error (if the fmterr option is in effect) or no error and simply the same unformatted values in the freq output.  The reason is that although we created the datasets, created the formats, and associated the formats with the dataset variables, SAS doesn't know where to look for the formats when it comes time to do the proc freq.  We have to give it a hint as follows:

options fmtsearch=(data);

The format search system option tells SAS to look for formats in the 'data' library.  Now when we run our proc freq, we get the desired result.

proc freq data=data.sales;
    tables product_id * size_cd;
run;

The FREQ Procedure

Table of product_id by size_cd

product_id        size_cd

Frequency        |
Percent          |
Row Pct          |
Col Pct          |1 (SMALL|2 (REGUL|3 (LARGE|  Total
                 |)       |AR)     |)       |
-----------------+--------+--------+--------+
ICE (ICE CREAM)  |      2 |      2 |      1 |      5
                 |  20.00 |  20.00 |  10.00 |  50.00
                 |  40.00 |  40.00 |  20.00 |
                 |  50.00 |  50.00 |  50.00 |
-----------------+--------+--------+--------+
SAND (ICE CREAM  |      0 |      1 |      0 |      1
SANDWICH)        |   0.00 |  10.00 |   0.00 |  10.00
                 |   0.00 | 100.00 |   0.00 |
                 |   0.00 |  25.00 |   0.00 |
-----------------+--------+--------+--------+
SNO (SNOW CONE)  |      2 |      1 |      1 |      4
                 |  20.00 |  10.00 |  10.00 |  40.00
                 |  50.00 |  25.00 |  25.00 |
                 |  50.00 |  25.00 |  50.00 |
-----------------+--------+--------+--------+
Total                   4        4        2       10
                    40.00    40.00    20.00   100.00

Since we created permanent datasets, permanent formats, and permanently associated the two, we do not have to have a format statement in the proc freq and this association will persist beyond the current SAS session.  So we can happily shut down SAS, start it up again and issue the following lines without error:

libname data 'fill in your own physical path here';
options fmtsearch=(data); *Remind SAS where the formats are;
proc print data=data.sales;  run;

Obs    product_id                     size_cd

  1    ICE (ICE CREAM)              1 (SMALL)
  2    SNO (SNOW CONE)              2 (REGULAR)
  3    ICE (ICE CREAM)              3 (LARGE)
  4    ICE (ICE CREAM)              1 (SMALL)
  5    SNO (SNOW CONE)              1 (SMALL)
  6    SAND (ICE CREAM SANDWICH)    2 (REGULAR)
  7    ICE (ICE CREAM)              2 (REGULAR)
  8    ICE (ICE CREAM)              2 (REGULAR)
  9    SNO (SNOW CONE)              3 (LARGE)
 10    SNO (SNOW CONE)              1 (SMALL)

One final note.  I find it helpful to tailor the description/label portion of the formats to include the actual code value (e.g., 2) and the descriptive value (e.g., SMALL).  The reason is that while the description is the more --um-- descriptive portion of the format label, the underlying SALES dataset still only really contains the coded values.  Applying a format changes how data is displayed, not the stored values.  Thus, to subset the small product transactions, you would use this where statement where size_cd=1 rather than this where statement where size_cd='1 (SMALL)' and having the 1 right in the label saves us having to look that up what "SMALL" means in terms of coded data values.

Monday, March 19, 2012

SAS Proc SQL Pass-through Connection Strings

I work in the health care research department of a managed health care organization in the Southeastern U.S.  SAS is a mainstay in the industry, and this is certainly true for my current work environment.  For the analysts in my group, SAS is the data manipulation and analysis tool.  The primary data repository that I manage exists in the form of SAS datasets.

However, much of the source system data from which the research data warehouse is built lives on other platforms, so I do a lot of ETL'ing from Teradata, Oracle, Sybase, and MSSQL into SAS.  For small stuff, like dimension/lookup tables, using libref access to move data into SAS is acceptable, performance-wise.  For example...

libname o oracle user=&oracle_uid password="&oracle_pwd"
                 path='oracle_path' schema='oracle_schema';
proc sql;

    create table sas_copy_of_table as
        select * from o.table;
quit;


But for larger pulls, especially having a where clause, libref access can be slow since SAS may choose to pull the entire data table over the network before applying the where clause.

(Aside: use the SASTRACE and SASTRACELOC system options to see if SAS is optimizing cross-platform database queries behind the scenes.)

The alternative is to use SAS pass-through queries.  These are queries written in the database's flavor of SQL (instead of the SAS flavor of SQL) that are passed off from SAS to the other database engine.  Only the final where-filtered results are returned to SAS.
 
There are two areas in which I've seen lots of heartache writing and debugging SQL pass-through queries in SAS.  First, getting the connection string just right, given how finicky SAS is, can be painful.  That is the focus of the rest of this post.  The other curse-inducing area is getting the right database-compliant (NOT SAS-compliant!) syntax into the query itself (e.g., employing Oracle--not SAS--functions and operators for date comparisons).  I plan to discuss this issue in a future post.

Without further ado, here are examples for Oracle, Sybase, Teradata, MSSQL (via OLEDB), and MSSQL (via ODBC).  Assume the *_uid and *_pwd macro variables (e.g., oracle_uid and oracle_pwd) contain database credentials and were declared in an include file prior to the code shown.  Each proc sql block creates a new work dataset called sample by selecting the top 10 observations from a database table called dbtable.

proc sql;
    connect to oracle (user=&oracle_uid

                       password="&oracle_pwd"
                       path='oracle_path');
    create table sample as select * from connection to oracle (
        select * from the_schema.dbtable where rownum<11);

quit;

proc sql;
    connect to sybase (user=&sybase_uid

                       password="&sybase_pwd"
                       server=syb_server
                       database=syb_db);
    create table sample as select * from connection to sybase (
        select top 10 * from dbtable);
quit;

proc sql;
    connect to teradata (user=&teradata_uid

                         password="&teradata_pwd"
                         tdpid='teradata_server');
    create table sample as select * from connection to teradata (
        select top 10 * from dbtable);
quit;

proc sql;
    connect to oledb (init_string="User ID=&mssql_uid;Password=&mssql_pwd;Provider=SQLOLEDB.1;Data Source=mssql_server;Initial Catalog=mssql_db");    

    create table sample as select * from connection to oledb (
        select top 10 * from the_schema.dbtable); 

quit;

proc sql; *SQL 2008;
    connect to odbc (noprompt="Uid=&mssql_uid;Pwd=&mssql_pwd;Driver={SQL Server Native Client 10.0};Server=mssql_server;Database=mssql_database");    

    create table sample as select * from connection to odbc (
        select top 10 * from the_schema.dbtable);

quit;

The website connectionstrings.com is a great help in figuring out driver names, etc.  Also, if you are unsure what SAS database engines you have installed, run proc setinit; run; and look in the log for entries like "SAS/ACCESS Interface to DB2," "SAS/ACCESS Interface to ORACLE" and so on.

Restarting blogging activity

I've decided to renew blogging about the technology that I use as a data warehouse manager and application developer.  My prior blog is technoyoga.blogspot.com.  I wanted to start fresh here rather than continue that one since there's been such a big break.

The name "obs-depot" is a sort of synonym for data warehouse.  Much of my work is with SAS and SAS calls records/rows "observations."  Depot is simply a synonym for warehouse or repository.  To state the obvious.  ;)

Cheers,
David