glorified proc contents for secondary source data

Glorified Proc Contents for Secondary Source Data

Bruce Thomas, VAMC/REAP, Providence RI

ABSTRACT

Researchers often want to know as much about their data as early as possible. This is particularly important in studies which use administrative, claims and utilization data from multiple sources. In these studies, data usually need to be combined and judgments made about data quality and relevance. SAS® Users are typically interested in data dictionaries, but most stop at PROC CONTENTS printed in monospace font. Borrowing from the ‘glorified proc contents’ approach employed in drug and device regulatory submissions, this paper describes a method that can be used to help document the contents of one or more datasets or format catalogs; in addition, it provides an approach to creating desktop guides in PDF format that describe the distribution of the variables in those datasets.

A BRIEF HISTORY OF META DATA

The term metadata’ is often referred to as ‘Data about Data.’ An alliterative abstraction which can lead to confusion for non-programmers, the term has been employed since the 1960's to describe the documentation codebooks that ran on IBM mainframe systems The term has since been trademarked i . The National Information Standards Organization makes a distinction between ‘Structural’ and ‘Descriptive’ metadata in Understanding Metadata (NISO, 2004). ii Structural metadata are ‘data about containers of data’, and Descriptive metadata are ‘data about data contents' used to help guide users to information.

Several common applications on the PC desktop employ Meta data to help people search for information. Anyone who uses a browser has already worked with Meta data, because it is widely employed in search engines. Early HTML web designers tried to populate the <META> tags with buzz words to attract seekers to their web pages. RSS feeds available on most browsers today use (Extensible Markup Language (XML) markup, which offers a way to organize content. FDA regulators use Define.XML as a backbone to review Clinical trial applications. Microsoft has developed XML as a basis for their Office documents (ever see the *.docx extension?), and SAS® has developed a full suite of markup tools that deal with XML schemas, which are in effect, Meta data. All are tools to facilitate organizing and retrieving information.

SAS datasets and SAS Views contain parts that describe the data, called data descriptors; SAS datasets also contain the data as well. This data description portion is made available in turn to CONTENTS and DATASETS procedures. The earliest SAS’® BASE procedure I know of to provide this functionality to end users is the CONTENTS Procedure, a.k.a PROC CONTENTS. Its’ basic utility is twofold: (1) it generates printed content and (2) it generates a dataset. This functionality is also present in PROC DATASETS’ CONTENTS statement, and some additional features of the CONTENTS statement permit creating additional output that describes indexes and integrity constraints.

Pharma & HealthcareNESUG 2011

The interactive display manager also provides a view contents window as well as the DIR and VAR windows, and when opening datasets in the view table one is able to examine of variable attributes by clicking on the column header. SAS® added the SQL Dictionary tables in Version 6.12 which are special tables that use one or more of the numerous SASHELP views that describe many of the objects in the SAS environment, including datasets, variables, formats and even macros. The advent of ODS allowed us to differentiate the parts of PROC CONTENTS output: the dataset attributes and variable attributes. The self-documenting feature of SAS datasets also helps with organizing and maintaining large complex databases, and we now see the Metadata Server as part of SAS’ ® Business Intelligence Platform.iii

Despite this evolution, widespread use of “data about data” in day to day work in the SAS® community has yet to occur, and it remains a specialty area that is difficult to integrate into the workflow. One reason may lie in its abstract nature and in the static one-off nature of the results. Once you’ve documented the dataset, there it is. There may well be experienced SAS programmers who never progress beyond beginning PROC CONTENTS, but their ability to develop reusable software solutions to solving problems may be limited by this.

SAS ® DATASET CONTENTS

The CONTENTS Procedure, a.k.a PROC CONTENTS is widely used. It is simple to code in a few lines. It provides access to descriptive information about datasets in printed form. It is even able to generate a dataset that contains both structural and descriptive Meta data to help guide us through the datasets in a study. Structural metadata are available In SAS’® familiar PROC CONTENTS output. As shown in

Fig.1, “Label”, “Observations’ and “Sorted”’ display basic information about the dataset as a whole. In this example, one can tell that the dataset was created during a session in the WORK library, that is not labeled (a typical problem, often not done with temporary datasets) and that it is not sorted. A recent enhancement provides information about the sort order in the display results after the ‘Alphabetic List of Variables and Attributes.’ Descriptive information about the dataset’s contents is more useful, and ‘Variable’, ‘Type’,’ Len’ and ‘Label’ provide information about the attributes of the variables contained in

Figure 1: Proc Contents Structural Information

Data Set Name WORK.FMTALL Observations 17839 Member Type DATA Variables 21 Engine V9 Indexes 0 Created Sunday, May 29, 2011 10:07: Observation Length 200 Last Modified Sunday, May 29, 2011 10:07: Deleted Observations 0 Protection Compressed NO Data Set Type Sorted NO Label Data Representation WINDOWS_64 Encoding wlatin1 Western (Windows) Engine/Host Dependent Information Data Set Page Size 16384 Number of Data Set Pages 221 First Data Page 1 Max Obs per Page 81 Obs in First Data Page 64 Number of Data Set Repairs 0 Filename fmtall.sas7bdat


the data set. ‘#’ in the output display reflects the order in which the variable appears in the Program Data Vector. The VARNUM option on PROC CONTENTS can be used to order the display by “#”, but the default is alphabetically by the “Variable” column. All SAS® datasets have Meta data that can be surfaced this way.

DATA ABOUT DATA

This printed output is helpful, but running The CONTENTS Procedure for every dataset in a big research project can be time consuming and error prone. The OUTPUT= option on the PROC CONTENTS statement permits creating a dataset that picks up the attributes during run time.

Output ‘CONTENTS’ datasets always describe what is in each dataset of interest and the user can specify both the library and the dataset(s) in the library on the DATA= statement. Each dataset of interest has its own variables, and the variable descriptions will appear in the contents dataset. SAS 101 tells us that variables have attributes, and here they are: they have a Name, a Label, a Length, a position in the dataset, a Type (Character or Numeric) and a Length. This is true for ‘regular’ datasets and even for the datasets produced by PROC CONTENTS output statements.

^ IN( Kansas.Anymore)

To successfully use PROC CONTENTS datasets to understand Meta data, it helps to remember that variable attributes are variables themselves. As the PROC CONTENTS display of a PROC CONTENTS dataset (Fig 4.) shows, there are several Meta data variables arranged in variable's order of appearance. The first thing to note is the order-- general dataset information, followed by general variable information, then by more structural meta information such as NOBS, SORTED (yes/no) and SORTEDBY ( a number in the sort order for the variable. In PROC CONTENTS’s output dataset in Version 9.2, there are currently 40 common variables for each dataset of interest. That’s seems like a lot of metadata.

The most useful variables for the data dictionary are highlighted: ‘NAME’ “Variable Name” can be as long as 32 characters now. TYPE is usually 'char' or 'num', 'FORMAT' describes the format assigned to the variable. These can be either user defined (e.g. $AGEFMT) or SAS formats (e.g.BEST11.). When

PROC CONTENTS DATA=FMTALL OUT=TEMP; RUN; PROC CONTENTS DATA=TEMP; RUN;

Alphabetic List of Variables and Attributes # Variable Type Len Label 20 DATATYPE Char 8 Date/time/datetime? 3 END Char 16 Ending value for format 1 FMTNAME Char 32 Format name 17 HLO Char 11 Additional information 4 LABEL Char 68 Format value label 8 LENGTH Num 3 Format length 11 MULT Num 8 Multiplier 13 NOEDIT Num 3 Is picture string noedit? 10 PREFIX Char 2 Prefix characters

Figure 2 Proc Contents Variable Descriptions

Figure 3 Contents of Proc Contents Output


the dataset is printed, the variable ”LIBNAME” here might show ‘WORK’ in upper case under the Library Name column for each observation.

Similarly, variable #2: MEMNAME the ‘Library Member Name’ column, would show the text ‘FMTALL,’ the name of the dataset we’re dealing with in our DATA= expression. Not much so far: in this particular case, we are looking at the FMTALL dataset, and we knew that anyway. This may be useful for a bigger data dictionary for a whole library, but our basic design is to create one for each dataset.

There are metadata that are not made available to SAS Users via PROC CONTENTS or the methods outlined so far. In the data I work with, variables are typically continuous (age in years), discrete (gender) or somewhere in between (zip codes). Most importantly, these variables all have some sort of distribution and varying degrees of ‘missingness.’ In some situations, the official documentation can be a bit dated and new variables sometimes might appear with no supporting documentation. The meanings of these variables are sometimes elusive, so their distributions can provide a clue. These aspects of metadata must be ascribed to the data properly.

DEFINE.PDF TO THE RESCUE

A principal driver for exploiting meta data with SAS ® since the early part of the 21st Century has been the desire by the FDA to work with the Industry to standardize the information that the Clinical trials research community submits for regulatory review. As a result, a growing body of SAS ® programmers has gained familiarity with computer assisted application processes (CANDA), Electronic Submissions (ESUB), and the SDTM and ADAM data models promulgated through the Clinical Data Interchange Standards Consortium (CDISC). PC desktop products have evolved as well in those settings, and many users may have firsthand experience working with pdf documents that integrate information about dataset structure, variable descriptions and with the data collection instruments themselves. iv Pharmaceutical and Device submissions have evolved beyond the days of this ‘DEFINE.PDF’ to a newer version based on XML, driven primarily by the Industry and regulators in search of true public domain software. The model for define.pdf is one where a table of dataset contents is hyperlinked to a rich body of objects of interest to reviewers: variable descriptions, the actual datasets, and for each variable: links to the format code lists and even to the page of the actual data collection instrument (referred to as 'blankcrf.pdf') i.

The define.pdf is an inventory of the variables that contained links to the actual datasets, the format catalogs and annotated case report forms – in effect, a code book for the study . I did not really need

The CONTENTS Procedure Variables in Creation Order # Variable Type Len Format Label 1 LIBNAME Char 8 Library Name 2 MEMNAME Char 32 Library Member Name 3 MEMLABEL Char 256 Data Set Label 5 NAME Char 32 Variable Name 6 TYPE Num 8 Variable Type 7 LENGTH Num 8 Variable Length 8 VARNUM Num 8 Variable Number 9 LABEL Char 256 Variable Label 10 FORMAT Char 32 Variable Format …… 18 NOBS Num 8 Observations in Data Set 24 MEMTYPE Char 8 Library Member Type 30 SORTED Num 8 Sorted and/orFigure 4 The Structure of Proc Contents Output


the extensive hyperlinking used in define.PDF, but hyper linking between proc contents output and the variable distributions or the contents of a format catalog was important, as was a way to get back to the table of contents from anywhere in the document. A proc report solution with hyper linking seemed like a good idea for the table of contents.v

The technology behind this hyper linking is based on the notion of PDF named destinations, which are low level objects in the Portable Document Format architecture. Unlike HTML anchors, named destinations in PDF are separate from textual data.vi A fully linked define .pdf document is a work of art; unfortunately, it is a work of art that is frequently created at the end of the clinical trials process rather than at the beginning. As such, it is expensive art.

DESIGN CONSIDERATIONS

Data dictionary end users (including me and other programmers as well as investigators) needed to be able to describe one or more variables in one or more datasets and associate those with some tabular information describing what each variable contains. The application needed to iterate through each dataset’s variable list, select the important ones and manage the way they are displayed and summarized. We know that a Mean zip code is (ahem) meaningless, just as a frequency table of ages is voluminous and not very interesting to look a; however the range of values including missing values can be very useful.

Once we have a way to iterate through a list of variables, we need to find out the type of each variable, a label that describes the variable for us (We should all label our variables), and some information about the variable’s format (name). PROC CONTENTS offers an easy-maintenance solution, particularly since we can use it to dynamically generate an output dataset to populate the list of variable names for our table of contents and links. The dataset could vary, but the proc contents would always give us the same variables (here, the NAME variable) to work with to help us name the destinations.

Each destination would consist of output from a SAS procedure and would be based on the value of this NAME variable. The SAS procedure would in turn be based on the number of levels in the variable as well as the variable’s ascribed type (Long discrete numeric, protected health information, continuous, discrete). To generate distribution information, we can construct standard routines to iterate through variables and run frequency or descriptive statistics procedures. To get the linking we need between variable name and its distribution or between the format name and its contents, ODS PDF ANCHOR and ODS PDF TEXT offered some good ways to define the destinations. To build the hyperlinks to the destination, the CALL DEFINE statement in proc report would need some special attention and some research into SUGI lorevii. With this set of basic ingredients, I was prepared to assemble a glorified PROC CONTENTS. For our format catalog tool, we can get format information into a dataset from the PROC FORMAT OUTPUT= dataset statement.

GLORIFIED PROC CONTENTS

In research programming, there is a recurring need for accessible information through the data management process about the way variables are defined and values are distributed in the data. This is particularly true in projects where a large amount of secondary

PROC CONTENTS DATA=WORK.FMTALL; RUN;

Fig.5 Un‐glorified PROC CONTENTS


source administrative data have to be screened and assimilated at various stages of analytic file development. Faced with the task of integrating data from at least 5 different sources spanning a period of several years, I chose to borrow from the define.pdf ‘model’ to construct a simple, standard dictionary application. This would consist of a table of contents and hyperlinks to PDF destinations containing either format descriptions or to information about variable distributions. In the former case, users would be able to generate a readable, linked description of what is in one or more format catalogs. In the latter case, this would offer end users a chance to look at the basic spread of values in variables found in datasets. Unlike define.pdf, I would not integrate the formats and the distributional information; this is a possibility for future improvements. The approach would have to be generalized so I wouldn’t have to rewrite too much code. SAS’ ® ODS PDF would be useful because Version 9.2’s enhancements indicate both new layout features and improved stability with graphics, it would be almost universally readable across PC and UNIX platforms and would handle hyperlinking.viii

Front End

In the first design iteration, we needed a simple interface where users could select variables and put them into different kinds of lists: (1) exclusions, (2) continuous variables,(3) discrete variables. Exclusions were easy – researchers really don’t need or want to know anything about certain kinds of private health information, and some data are simply not of any interest from the outset. Continuous variables were interesting, because they could be described using proc means, the summary proc of choice as well as simple distribution plots. A box plot feature through SGPLOT in SAS ® 9.2 offered some interesting features to help us understand continuous variables.

These processes are easily generalized across data types, but to make that possible, we needed a wrapper to handle particular instances of Meta data that live in a particular study dataset. The solution is a wrapper program (BuildDictionary.sas). The particular dataset under consideration in the wrapper program would provide the dataset model we need to help build the different variable lists we needed.

Once we had a way to handle the particular dataset, we could then pass its metadata to a controller program (DataDictionary.sas fig. 6) that would actually create the different variable lists, generate a table of contents for each list and hyperlinks to PDF destinations. The controller would also call the appropriate frequency or distribution routine. The controller program is invoked in the wrapper program by setting macro parameters. For example, PHI (for protected health information) removes personal identifiers from processing altogether, and DataDictionary.sas in turn uses these parameters to generate temporary datasets containing variables with each data type and routes these to the appropriate summary routine.

%DataDictionary( dslib=RAW

,dsname=XMBASE ,titleThis is a title ,phi=scrssn ,category= ,cutoff=25 ,Longdiscretevars= bornday bornyear disday distime admitday adtime homecnty visn zip statyp scper homepsa dxf11 dxf12 dxf13 dxf2 dxf3 dxf4 dxf5 dxf6 dxf7 dxf8 dxf9 dxf10 updatday sta3n homstate drg );

Figure 1 Data Dictionary Launch macro


Originally, I put the burden on the user (me) to correctly put the variables into the different buckets. While a result was achievable, this involved some tedious trial and error, forcing me to test each run with obs=100 for accuracy then with obs=1000+ to make sure the distributions were coming out as expected. A subsequent iteration opted to use a threshold ix to help make the decision about how to handle the discrete variables. Now, if there were more values than the threshold and it is NOT a continuous variable, then a ‘top/bottom 5’ approach would be used, otherwise a frequency table would be generated. At this time, the only lists that require populating are the numeric variables to exclude from continuous variable processing (e.g. Zip Code). That populates a variable named ‘LongDiscreteVars.’ This call is coded in the BuildDictionary.sas program(see fig 6.).

Back End

The DataDictionary macro builds a single ordered list of variables including basic variable metadata for the report table of contents in the REPORT Procedure; at the same time, it and adds links to PDF destinations via hyperlinks created in a CALL DEFINE Block. The macro then uses this list of variables to identify the ascribed data type, places each variable name in an input dataset and routes each dataset to the proper summary (DD) macro. Each DD macro in turn, applies PROC SQL to generate an ordered list and iterates through it: this starts with frequency tables, ‘top/bottom5’ processing and proc means summarization for continuous variables. At the start of each process for each variable, the DD<name> macro defines a unique PDF destination using the ODS PDF ANCHOR statement with the Value of the Variable’s NAME( from proc contents). This creates a PDF document ‘destination’ that reflects the anchoring used in the Table of Contents. After completing the procedure, DataDictionary applies a ‘Return to contents’ footnote with a link and It then proceeds to the next variable. The output is ordered as follows frequencies, top/bottom5 , continuous variables, but the table of contents is alphabetical.

The statistical procedures are provided in a series of ‘DD’ macros that need to be made available to the SAS session, either through %include or an autocall library. I developed small single purpose macros in a library (www.github/bhthomas) to handle the three different data types. This is a small, reusable code library that could be invoked without the dictionary front end. One important limitation of these 3 macros is that they are all currently ods and pdf-specific, containing expressions like:

ODS PDF ANCHOR="&&NME&I" STARTPAGE=NOW;

This could be made conditional at some point, but PDF linking is kind of nice and PDF viewers are everywhere now. In this design, the data types are processed as a group and routed as a group to the appropriate SAS ® procedures.

Hyperlinking was the most complex part of the application because there could be two parts that needed to be processed in the same order. The table of contents used a URL format to map variable numbers (from the proc contents VARNUM attribute) to a location defines internal bookmarks in the pdf that would need to refer to an anchor defined with the variable’s NAME attribute. To make the table of contents less terse, I opted to use the variable’s label if it was available.

Figure 2 How many levels in the data?

ods output nlevels=numbr;

proc freq data=IN_ nlevels;

tables _all_/noprint;

run;


OUTPUT

The report consists of a Title page created with Proc GSLIDE using the title parameter. The next section is a table of contents generated by proc report using a proc contents dataset to display the variable’s name, label, length and format. In that table, the variable name is hyperlinked using an approach found in Carpenter (2007). The

address for the table of contents is defined using ODS PDF ANCHOR='contents';

ODS PDF output in SAS has several default settings that may confuse readers and a few issues needed to be addressed to make it readable. These issues include:

• The table of contents Bookmarks are exploded by default. This is handled in SAS 9.2 with ODS PDF statement options, where only the first level of the Table of contents is displayed: bookmarklist=show pdftoc=1

• The primary bookmarks themselves are not very descriptive (e.g. “The PRINT Procedure’, ‘Data Set WORK.HKZIP’). This is handled through ODS PROCLABEL Statements that pick up the name or label of the variable currently being summarized from the list.

• The list of bookmarks is long and sometimes unwieldy, particularly if no variable labels are available. The table of contents exists to provide a path into the dataset anyway, so in SAS 9.2. I was able to insert ODS PDF NOBOOKMARKGEN before the various summary procedures removed them altogether.

These options together yielded the relatively attractive bookmark list to the right. Each takes the user to a section of the table of contents where variable names and labels are listed.

While ODS PDF Bookmarks are now easily manipulated to reduce clutter, thornier visual problems were encountered that include,

• At low magnification, PDF anchor Text appears truncated at the end of the string.

• Hyperlinks have a blue border all around the link. Users have grown to expect browser-style links.

A t the completion of each variable summarization (one per page), the expression

ODS PDF TEXT="^S={JUST=C URL='#CONTENTS' LINKCOLOR=WHITE}Return to ^S={COLOR=BLUE}Contents .";

points the user back to the table of contents. The LinkCOLOR= style setting is designed to get rid of an annoying blue box that is painted by default around each hyperlink. Note also the extra spaces after

Return to Contents


‘the text ‘Contents’ and the closing period: At around 56% magnification in the Adobe Reader, the Text would start to appear whited out at the end, so I added some text padding. This problem goes away at 100% magnification, but the Period is still there. So In the Body of the report, we have a Blue Link that takes us back to the contents destination in the document.

In the Table of Contents, each variable appears by variable type(Continuous-Range-Discrete) and variable name in alphabetical order in the table of contents. Clicking the hyperlink at the variable name takes the user to a page that shows the correct SAS ® output. Output examples are shown in the Appendix.

CONCLUSION

This approach has been used in 12 different locations with raw as well as analysis datasets and serves as a valuable reference documentation and data exploration tool. This simple table of contents approach is easily adapted do documenting format catalogs and such reference tools are blooming in the file system. I would argue that this Glorified Proc Contents approach is becoming almost idiomatic in SAS and that if SAS programmers users take the time to understand what ‘data about data’ really means, they can add real value to their work. If I can develop something useful like this, much credit is due to the patience and intelligence of a veritable army of SAS users, SUGI authors and excellent programmers who have take the time to show me what code reuse and metadata driven programming is really about.

ACKNOWLEDGMENTS

I would like to thank William Qubeck for coining the term used in the title and for introducing me to define,pdf in the first place.

SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.

Other brand and product names are registered trademarks or trademarks of their respective companies.

CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at:

Bruce Thomas VAMC/ Providence Research Building 32 Providence, RI [email protected]

Code for this paper should be considered public domain:


via github : https://github.com/bhthomas/nesug

REFERENCES/ NOTES

i The term "metadata" is copyrighted and trademarked by Metadata, LLC of Nashville, TN. (www.metadata.com). Formerly known as Metadata Information Partners, it is a software firm specializing in data management products, consulting, and custom information systems to the health care industry. Although the term "metadata", spelled the same way, is widely used to refer to "data about data", Metadata LLC trademarked the name in 1986 and was granted "incontestable" status in 1991. So there is always the threat that if we publicly use the term "metadata" the company could pursue their trademark enforcement. The most common substitutions for "metadata" we see today are "meta-data" or "meta data".

From Correct Terminology: Do We Say "Metadata", "Meta-Data", or "Meta Data" ? metadataforums.com, Stu Carty, Published 01/7/2007

ii Understanding Metadata Copyright © 2004 National Information Standards Organization ISBN: 1-880124-62- http://www.niso.org/publications/press/UnderstandingMetadata.pdf

iii Cynthia Zender, SAS ® , VASUG presentation on Stored Processes, May 2011

iv See PharmaSUG2011 ‐ Paper TU01 Creating Hyperlinked PDF Graphical Patient Profiles with PROC REPORT William Conover, Advanced Clinical, Bannockburn, IL http://www.pharmasug.org/proceedings/2011/TU/PharmaSUG‐2011‐TU01.pdf

v Art Carpenter discusses proc report hyperlinking in http://www.lexjansen.com/wuss/2009/how/HOW‐Carpenter.pdf.

vi How to put PDF Named Destinations work for you”,http://www.mindtheflex.com/?p=86#more-86, Sven-Olav Paavel, 2011 vii Art Carpenter (note v)

viii It has. For a discussion of hyperlinking in ODS Pdf:

http://support.SAS ® .com/resources/papers/proceedings10/035‐2010.pdf

ix Continuous or Not: How One Can Tell Vatsala Karwe, Mathematica Policy Research, http://www2.SAS ® .com/proceedings/sugi28/088‐28.pdf


Figure 3 -- Proc Gslide Title Page

Ph

arma &

Health

careN

ES

UG

2011

Figure 4- Table of contents (hyperlinks in blue)

Ph

arma &

Health

careN

ES

UG

2011

Figure 5 -- Top/Bottom5 a display

Ph

arma &

Health

careN

ES

UG

2011

Figure 6 - Continuous Variable Display

Ph

arma &

Health

careN

ES

UG

2011

Figure 7 -- Frequency Distribution

Ph

arma &

Health

careN

ES

UG

2011

glorified proc contents for secondary source data

Documents