This post will show how to do that using a very simple made-up set of three tables as shown here:
The detail table SALES contains one record per sale. The variables PRODUCT_ID and SIZE_CD indicate what product and what size product was sold. Ten sample observations are shown above. The 'data' library also contains a pair of lookup tables to translate the PRODUCT_ID into a description and SIZE_CD into a description. The lookup tables are called PRODUCT_LOOKUP and SIZE_LOOKUP.
Here is SAS code to create the three tables.
libname data 'fill in your own physical path here';
data data.sales;
length product_id $5. size_cd 3.;
input product_id $ size_cd;
datalines;
ICE 1
SNO 2
ICE 3
ICE 1
SNO 1
SAND 2
ICE 2
ICE 2
SNO 3
SNO 1
;
run;
data data.product_lookup;
length product_id $5. description $20.;
input product_id $ 1-5 description $ 6-25;
datalines;
ICE ICE CREAM
SNO SNOW CONE
SAND ICE CREAM SANDWICH
;
run;
data data.size_lookup;
length size_cd 3. description $20.;
input size_cd description $;
datalines;
1 SMALL
2 REGULAR
3 LARGE
;
run;
If we run a proc freq to crosstab products with sizes, we get this (first code, then output):
proc freq data=data.sales;
tables product_id * size_cd;
run;
tables product_id * size_cd;
run;
Table of product_id by size_cd
product_id size_cd
Frequency|
Percent |
Row Pct |
Col Pct | 1| 2| 3| Total
---------+--------+--------+--------+
ICE | 2 | 2 | 1 | 5
| 20.00 | 20.00 | 10.00 | 50.00
| 40.00 | 40.00 | 20.00 |
| 50.00 | 50.00 | 50.00 |
---------+--------+--------+--------+
SAND | 0 | 1 | 0 | 1
| 0.00 | 10.00 | 0.00 | 10.00
| 0.00 | 100.00 | 0.00 |
| 0.00 | 25.00 | 0.00 |
---------+--------+--------+--------+
SNO | 2 | 1 | 1 | 4
| 20.00 | 10.00 | 10.00 | 40.00
| 50.00 | 25.00 | 25.00 |
| 50.00 | 25.00 | 50.00 |
---------+--------+--------+--------+
Total 4 4 2 10
40.00 40.00 20.00 100.00
While we are actively working on the sales data project, we might easily retain in our memory the fact that size 1 is small, size 2 is regular, and size 3 is large, but what if we set aside this project for 6 months or longer and then come back to it? What if we had hundreds of products and dozens of size options? What if we need to hand this project off to a new analyst or present findings to other staff not familiar with the data? It would make all of our lives easier if the proc freq output indicated what sizes 1, 2, and 3 mean, as well as what the product codes mean. We have the data already to provide helpful descriptions (the two lookup tables), so let's work at incorporating them.
Using proc format with the cntlin parameter gives us the ability to create a format from a dataset. In this case, the dataset needs only three fields:
- START - the code value,
- LABEL - the description associated with the code, and
- FMTNAME - the name of the format to be created (same value repeated every record).
Here's the query to define the proc format input data view for the product format:
proc sql;
create view for_product_format as
select product_id as start,
strip(product_id)||' ('||strip(description)||')'
as label,
'$product' as fmtname
from data.product_lookup;
quit;
create view for_product_format as
select product_id as start,
strip(product_id)||' ('||strip(description)||')'
as label,
'$product' as fmtname
from data.product_lookup;
quit;
A proc print of the view (proc print data=for_product_format; run;) looks like this:
Obs start label fmtname
1 ICE ICE (ICE CREAM) $product
2 SNO SNO (SNOW CONE) $product
3 SAND SAND (ICE CREAM SANDWICH) $product
Turning this data view into a format is as simple as:
proc format cntlin=for_product_format library=data; run;
In response to running this code, the SAS log says:
NOTE: Format $PRODUCT has been written to DATA.FORMATS.
In the physical library pointed to by the 'data' libref, a new file called formats.sas7bcat (if using SAS versions 7-9, at least). The following two lines of code inspect what's in that new format catalog:
proc catalog c=data.formats; contents; run;
proc format fmtlib library=data; select $product; run;
Contents of Catalog DATA.FORMATS
# Name Type Create Date Modified Date Description
----------------------------------------------------------------
1 PRODUCT FORMATC 26Mar12:15:54:52 26Mar12:15:54:52
----------------------------------------------------------------
| FORMAT NAME: $PRODUCT LENGTH: 25 NUMBER OF VALUES: 3 |
| MIN LENGTH: 1 MAX LENGTH: 40 DEFAULT LENGTH 25 FUZZ: 0 |
|--------------------------------------------------------------|
|START |END |LABEL (VER. V7|V8 26MAR2012:15:54:53)|
|----------+----------+----------------------------------------|
|ICE |ICE |ICE (ICE CREAM) |
|SAND |SAND |SAND (ICE CREAM SANDWICH) |
|SNO |SNO |SNO (SNOW CONE) |
----------------------------------------------------------------
The proc catalog output tells us there is one format in the catalog named PRODUCT and it's a character format (so it would be referred to in SAS code as $PRODUCT). The proc format fmtlib output shows us the individual code and description pairs contained in the format. As you can see, the constructed labels consist of the code values followed by the descriptions in parentheses.
Code to create a format for SIZE_CD is as follows:
proc sql;
create view for_size_format as
select size_cd as start,
strip(put(size_cd,best.))||' ('||strip(description)||')'
as label,
'size' as fmtname
from data.size_lookup;
quit;
create view for_size_format as
select size_cd as start,
strip(put(size_cd,best.))||' ('||strip(description)||')'
as label,
'size' as fmtname
from data.size_lookup;
quit;
proc format cntlin=for_size_format library=data; run;
SIZE_CD is numeric, whereas PRODUCT_ID was character, so SIZE_CD is converted to character with the put function using the built-in best format to make the label, and the name of the format is simply SIZE with no prefixing dollar sign.
Running the proc catalog code again (proc catalog c=data.formats; contents; run;) indicates that we do in fact now have two formats.
Contents of Catalog DATA.FORMATS
# Name Type Create Date Modified Date Description
----------------------------------------------------------------
1 SIZE FORMAT 26Mar12:16:14:27 26Mar12:16:14:27
2 PRODUCT FORMATC 26Mar12:15:54:52 26Mar12:15:54:52
Only one step remains and that is to attach the formats to the variables in the SALES dataset. That is accomplished with proc datasets as follows:
proc datasets lib=data nolist;
modify sales;
format product_id $product. size_cd size.;
quit;
modify sales;
format product_id $product. size_cd size.;
quit;
If we run our original proc freq at this point, we will either get an error (if the fmterr option is in effect) or no error and simply the same unformatted values in the freq output. The reason is that although we created the datasets, created the formats, and associated the formats with the dataset variables, SAS doesn't know where to look for the formats when it comes time to do the proc freq. We have to give it a hint as follows:
options fmtsearch=(data);
The format search system option tells SAS to look for formats in the 'data' library. Now when we run our proc freq, we get the desired result.
proc freq data=data.sales;
tables product_id * size_cd;
run;
tables product_id * size_cd;
run;
The FREQ Procedure
Table of product_id by size_cd
product_id size_cd
Frequency |
Percent |
Row Pct |
Col Pct |1 (SMALL|2 (REGUL|3 (LARGE| Total
|) |AR) |) |
-----------------+--------+--------+--------+
ICE (ICE CREAM) | 2 | 2 | 1 | 5
| 20.00 | 20.00 | 10.00 | 50.00
| 40.00 | 40.00 | 20.00 |
| 50.00 | 50.00 | 50.00 |
-----------------+--------+--------+--------+
SAND (ICE CREAM | 0 | 1 | 0 | 1
SANDWICH) | 0.00 | 10.00 | 0.00 | 10.00
| 0.00 | 100.00 | 0.00 |
| 0.00 | 25.00 | 0.00 |
-----------------+--------+--------+--------+
SNO (SNOW CONE) | 2 | 1 | 1 | 4
| 20.00 | 10.00 | 10.00 | 40.00
| 50.00 | 25.00 | 25.00 |
| 50.00 | 25.00 | 50.00 |
-----------------+--------+--------+--------+
Total 4 4 2 10
40.00 40.00 20.00 100.00
Since we created permanent datasets, permanent formats, and permanently associated the two, we do not have to have a format statement in the proc freq and this association will persist beyond the current SAS session. So we can happily shut down SAS, start it up again and issue the following lines without error:
libname data 'fill in your own physical path here';
options fmtsearch=(data); *Remind SAS where the formats are;
proc print data=data.sales; run;
options fmtsearch=(data); *Remind SAS where the formats are;
proc print data=data.sales; run;
Obs product_id size_cd
1 ICE (ICE CREAM) 1 (SMALL)
2 SNO (SNOW CONE) 2 (REGULAR)
3 ICE (ICE CREAM) 3 (LARGE)
4 ICE (ICE CREAM) 1 (SMALL)
5 SNO (SNOW CONE) 1 (SMALL)
6 SAND (ICE CREAM SANDWICH) 2 (REGULAR)
7 ICE (ICE CREAM) 2 (REGULAR)
8 ICE (ICE CREAM) 2 (REGULAR)
9 SNO (SNOW CONE) 3 (LARGE)
10 SNO (SNOW CONE) 1 (SMALL)
One final note. I find it helpful to tailor the description/label portion of the formats to include the actual code value (e.g., 2) and the descriptive value (e.g., SMALL). The reason is that while the description is the more --um-- descriptive portion of the format label, the underlying SALES dataset still only really contains the coded values. Applying a format changes how data is displayed, not the stored values. Thus, to subset the small product transactions, you would use this where statement where size_cd=1 rather than this where statement where size_cd='1 (SMALL)' and having the 1 right in the label saves us having to look that up what "SMALL" means in terms of coded data values.
No comments:
Post a Comment