Obs Depot: 2013

Friday, July 26, 2013

Avoid hard-coding database passwords in your SAS code

It's pretty common to connect to other databases within SAS code. In my current job, I connect to Oracle, Sybase, Teradata, and MS SQL databases on a regular basis. The credentials (userid + password) are different on each system and I have to pass them as part of a LIBNAME statement or a PROC SQL connect statement (before pass-through querying) to use those databases.

There are a couple of things you can do to secure your credentials so they're not hard-coded in your SAS files:

At the very least, encrypt your password with PROC PWENCODE. Others can still use your credentials in SAS code if they happen to come across them, but they won't actually be able to see what your password actually is.
Store your credentials as macro variables in a separate file that lives in a personal directory that nobody else can access and %INCLUDE that file in your main program. Now nobody can see your password but you.

To encrypt a password ("mysecret") using PROC PWENCODE, run this simple statement:

PROC PWENCODE IN="mysecret"; run;

This outputs the encrypted version of the password to the log as follows.

{sas002}75F37A290F99C066181D56E908E72A6239A5E508

Copy that to the clipboard. Next create a simple text file with a .sas extension on a personal drive, let's call it the X drive. In this .sas file, there will be two macro variable declarations, one for the database userid and one for the database password. For example, the file X:\sas\credentials\oracle_database_credentials.sas file contains the following two lines:

%let oracle_uid=bubba;
%let oracle_pwd={sas002}75F37A290F99C066181D56E908E72A6239A5E508;

I only have to do the PWENDCODE and oracle_database_credentials.sas steps once, and then again each time the password changes.

Each program that I subsequently write to access the Oracle database will have code that looks like the following:

%include 'X:\sas\credentials\oracle_database_credentials.sas';

proc sql;
    connect to oracle (user=&oracle_uid password="&oracle_pwd" path='oracleserver');
    create table sample as
        select * from connection to oracle (
            select * from stuff.sampledata
        );
    disconnect from oracle;
quit;

With the SYMBOLGEN option turned on, the value of oracle_pwd will be echoed to the SAS log, so it's a good idea to do the PWENCODE step in addition to the X-drive %include step. You don't want to go through the trouble of X-drive protecting your password only to have it appear in .log files for all to see.

Wednesday, July 24, 2013

Binary search aha! moment

Although I've used database table indexes for eons, and understand how and when to use them, I recently had a real computer science-y aha! moment that gave me a new appreciation for indexes.

I was re-reading Kochan's Programming in C, and wrote a little program to calculate binary search worst scenario step counts for collections of various sizes. So, for instance, given a collection of N sorted linked list items, empirically calculate the maximum number of steps needed to search the list to find an arbitrarily selected item.

I was floored when I saw that binary searching can find an item in a collection as large as one million items in 21 steps or less! I have a newfound appreciation for how much more desirable index seeks are in comparison to (sequential) table scans!

In computer science lingo, the worst case binary search has a time complexity of O(log n) which means the worst case scales up as the log of the collection size. This is a good thing since the log of a number x increases much slower than x increases. O(log n) time vs. O(n) time (a.k.a. linear time). The output of the program (shown below) below clearly shows this to be the case.

#include <stdio.h>
#include <math.h>

/*
Calculate binary search worst case step counts for a range of collection
sizes and compare to worst case step counts with log2(collection size).
Do this for a wide range of node sizes. It's remarkable that in only 21
steps (or fewer, obviously), any entry can be found in a collection of a
million sorted items. Especially compared to a sequential search of the
entire collection.
*/

unsigned long int worst_case(unsigned long int collectionSize)
{
/*
This pretends to step through a binary search process that assumes
the target item is the last item in the collection which requires
the maximum number of steps. Binary search starts with an upper and
lower bound and sees if the target item is above or below the midpoint
and adjusts the lower and upper bounds of the search's next step to
be the half of the distribution containing the target item.  Halving
of the next step's search area continues until the item is found or
the item is determined not to be present in the collection. Again, this
does not actually do a binary search, it just counts the number of steps
required to actually do one.
*/
unsigned long int low=0;
unsigned long int high=collectionSize-1;
unsigned long int mid;
unsigned long int steps=0;
unsigned long int target=collectionSize-1; //Last number in virtual collection
while(low<=high)
{
mid=(low+high)/2; //Midpoint of current range of focus
steps++;
if (mid==target)
break; //Simulates item found, so exit loop
else if (target<mid) //Simulates target is in lower half of range
high=mid;
else //Target>mid -- Simulates target is in upper half of range
low=mid+1;
}
return steps;
}

int main()
{
unsigned long int maxCollectionSize = pow(2,24);
int i;
unsigned long int collectionSize;
float log2ofSize;
unsigned long int wc;

printf("Collection Size Worse Case Log2(Collection Size)\n");
printf("--------------- ---------- ---------------------\n");

for(i=0; pow(2,i)<=maxCollectionSize; i++)
{
collectionSize=pow(2,i);
log2ofSize = log2(collectionSize);
wc = worst_case(collectionSize);
printf("%15i %10i %21.2f\n", collectionSize, wc, log2ofSize);
}

return 0;
}

/*
Output:

Collection Size Worse Case Log2(Collection Size)
--------------- ---------- ---------------------
1 1 0.00
2 2 1.00
4 3 2.00
8 4 3.00
16 5 4.00
32 6 5.00
64 7 6.00
128 8 7.00
256 9 8.00
512 10 9.00
1024 11 10.00
2048 12 11.00
4096 13 12.00
8192 14 13.00
16384 15 14.00
32768 16 15.00
65536 17 16.00
131072 18 17.00
262144 19 18.00
524288 20 19.00
1048576 21 20.00
2097152 22 21.00
4194304 23 22.00
8388608 24 23.00
16777216 25 24.00
*/

Monday, July 8, 2013

Turn off ERRORABEND for an optional, potentially unstable part of a SAS program

Usually I want my production SAS processes to terminate immediately when they encounter an unhandled error. This is to prevent corrupting existing data structures with bad updates and the like. To this end, I set the ERRORABEND option at the beginning of these programs since it will abend/terminate a program when an error occurs.

Some of my programs %INCLUDE code downloaded from a live FTP site. I always want the latest code or I could cache a local copy. Unfortunately the FTP site is sometimes unexpectedly down for maintenance when my program calls out to it and causes an error. Given that I have ERRORABEND set, this crashes the program run before it even really gets going. I decided that the FTP-related code is not absolutely essential for a successful program run, so I needed a way to keep going through the rest of the program even if the FTP part fails.

The approach I've taken to "ignore" an FTP-related error is to set the NOERRORABEND system option right before the FTP download and then set it back to ERRORABEND (or whatever it was set to) right after the FTP code gets executed.

The key part is the DICTIONARY.OPTIONS table that stores current option settings. You can think of this table as storing a list of key-value pairs, with the key being the option name and the value being the option's current setting. The key variable is called OPTNAME and the value variable is called SETTING.

Here's a sample of what you might find in the table.

proc sql;
select optname, setting
from dictionary.options
where upcase(optname) like 'ER%';
quit;

Option Name Option Setting
--------------------------------------
ERRORABEND NOERRORABEND
ERRORBYABEND NOERRORBYABEND
ERRORCHECK NORMAL
ERRORS 20

From this, you can tell my current setting for ERRORABEND is NOERRORABEND.

The unstable FTP handling code uses this information and executes the following steps: (1) grab the current value of the ERRORABEND option from DICTIONARY.OPTIONS, (2) stash the value in a macro variable for later use, (3) set the NOERRORABEND option, (4) run the FTP-related code, and (5) revert back to whatever (NO)ERRORABEND setting was in force before (the value stored in the macro variable during step 1). Here is the code.

*Capture current ERRORABEND setting in a macro var;
proc sql noprint;
select setting into :errorabend_setting
from dictionary.options
where upcase(optname)='ERRORABEND';
quit;

*Turn off errorabend with NOERRORABEND setting;
options NOERRORABEND;

*Download and run the code file hosted on FTP;
filename runme ftp "file_to_download.sas"
host = "made_up_domain_name.org"
cd = "/formatcode"
pass = "secret"
user = "myloginname";
%include runme;

*Return errorabend setting to what it was prior to the FTP steps;
options &errorabend_setting;

A google search for "%opt2mvar" illustrates how this functionality can be packaged up in to a reusable macro.

Obs Depot