sas is a registered trademark or trademark of sas institute inc. in the usa and other countries. ®...
Post on 21-Dec-2015
216 views
TRANSCRIPT
SAS is a registered trademark or trademark of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are registered trademarks or Trademarks of their respective companies.
Using XML Mapper and XMLMAP to Read Data Documented by Data Documentation Initiative (DDI) Files Larry Hoyle
Policy Research Institute
University of Kansas
Overview
A SAS program reads an XML metadata file and writes a SAS program to read the raw data file described by the metadata file.
makeReader.sas
2825.xml DDI.map
Read2825.sas
Read2825.sas
Da2825.txt
Work.ICPSR2825Household
Work.ICPSR2825Family
Work.ICPSR2825Person
DDI
“an international effort to establish a standard for technical documentation describing social science data” - http://www.icpsr.umich.edu/DDI/index.html
DDI Files
XML• DTD - http://www.icpsr.umich.edu/DDI/users/dtd/index.html
Metadata about:• The DDI file itself
• The study that collected the data
• The data file
• Variables within the data file
• Other Material
The Minimal DDI File
<?xml version="1.0"?>
<codeBook>
<stdyDscr>
<citation>
<titlStmt>
<titl>Howdy World: Valid but Useless Metadata</titl>
</titlStmt>
</citation>
</stdyDscr>
</codeBook>
Real Example: ICPSR 6084 Raw Data File
100001 161132146211115555299 9991199 99911219
200001 49992
30000102534 000325222641942 3834101202
100002 12213112421111212112222221121 2122 12
200002 12221
30000202574 000756221622052 4261103202
ICPSR 6084 – About The Study
<citation>
<titlStmt>
<titl>CBS News Monthly Poll #2, August 1992</titl>
ICPSR 6084 – About The File
<dimensns>
<caseQnty>1,546</caseQnty>
<varQnty>70</varQnty>
<logRecL>80</logRecL>
<recPrCas>3</recPrCas>
ICPSR 6084 – Reading The File With SAS
infile 'C:\DDRIVE\data\icpsr\data\6084\da6084.txt' LRECL=80 PAD;
<dimensns> <caseQnty>1,546</caseQnty> <varQnty>70</varQnty> <logRecL>80</logRecL> <recPrCas>3</recPrCas> <recNumTot>4,638</recNumTot> </dimensns>
More From ICPSR 6084 – first variable
More From ICPSR 6084 – Reading the first variable
input
#1 cardno 1-1
More From ICPSR 6084 – another variable
#3 respno 2-6
The Tasks
Pull the necessary information from a hierarchical xml file into SAS as tables• Use XML libname engine with an XMLMAP file
Use that information in SAS to read the raw data file
Making the XMLMAP File – SAS XML Mapper
Defining Tables –What Defines Rows
Drag the element that defines rows to the root of the XMLMAP structure
Defining Tables –Row Defined
Defining Tables – What Defines Columns
Drag an element that defines a column to the root of the table
Defining Tables – Column Defined
Viewing The XMLMap File
Viewing The XMLMap File – Row Path
Viewing The XMLMap File - Column Path
Viewing Sample SAS Code
Previewing the Table
XMLMapper Limitations
Not every XML file will have all the elements of any possible XML file of that type.• Use XML Schema instead of XML file
An XML Schema file may not work• XML file type defined by DTD
• XML Schema too complex for XML Mapper
What then
You can use XMLMapper to start and then hand edit the XML MAP file.
Lots of Tables From DDI Mostly for Comments
DATADSCR_VARDATADSCR_VARGRPDATADSCR_VAR_CATGRYDATADSCR_VAR_INVALRNGDATADSCR_VAR_INVALRNG_ITEMDATADSCR_VAR_INVALRNG_RANGEDATADSCR_VAR_VALRNG_ITEMDATADSCR_VAR_VALRNG_RANGEDOCDSCR_CITATION__AUTHENTYDOCDSCR_CITATION__COPYRIGHTDOCDSCR_CITATION__IDNODOCDSCR_CITATION__OTHIDDOCDSCR_CITATION__PRODDATEDOCDSCR_CITATION__PRODUCERDOCDSCR_CITATION__TITLFILEDSCR_FILETXTFILEDSCR_FILETXT_RECGRPSTDYDSCR_CITATION_BIBLCITSTDYDSCR_CITATION_TITLSTMTSTDYDSCR_CITATION_VERSTMTSTDYDSCR_CITATION__AUTHENTYSTDYDSCR_CITATION__COPYRIGHTSTDYDSCR_CITATION__DISTRBTR
STDYDSCR_CITATION__FUNDAGSTDYDSCR_CITATION__GRANTNOSTDYDSCR_CITATION__PRODDATESTDYDSCR_CITATION__PRODUCERSTDYDSCR_CITATION__SOFTWARESTDYDSCR_METHOD__COLLMODESTDYDSCR_METHOD__DATACOLLECTORSTDYDSCR_METHOD__FREQUENCSTDYDSCR_METHOD__RESINSTRUSTDYDSCR_METHOD__SAMPPROCSTDYDSCR_METHOD__TIMEMETHSTDYDSCR_METHOD__WEIGHTSTDYDSCR_STDYINFO_ABSTRACTSTDYDSCR_STDYINFO__ANLYUNITSTDYDSCR_STDYINFO__COLLDATESTDYDSCR_STDYINFO__DATAKINDSTDYDSCR_STDYINFO__GEOGCOVERSTDYDSCR_STDYINFO__KEYWORDSTDYDSCR_STDYINFO__NATIONSTDYDSCR_STDYINFO__TIMEPRDSTDYDSCR_STDYINFO__TOPCCLASSTDYDSCR_STDYINFO__UNIVERSE
data _null_; file reader lrecl=1024 ; length vEdited $ 2000; set DDIfile.stdyDscr_citation_titlStmt; if _n_=1 then put '/*' / ' SAS program to read ' agency ' ' IDNo ;
stdyDscrTitl= compbl(tranwrd(translate(stdyDscrTitl, ' ', '09'x), '*/', '*_/')); put 'Study Title' _n_ ': ' stdyDscrTitl;
altTitl=compbl(tranwrd(translate(altTitl, ' ','09'x),'*/','*_/')); put ' ' altTitl;
Write a SAS Program – Metadata Comment
/* SAS program to read ICPSR 6084Study Title1 : CBS News Monthly Poll #2, August 1992 August National Poll II, Republican National Convention
Cntlin file for Formatsdata makeTheFormats; input fmtname $ 1-7 type $ 9-9 start $ 11-26 default 28-35 / label :&$512.;datalines;V00006f N 1 1YesV00006f N 3 1Converted Refusal
;run;proc format cntlin=makeTheFormats;run;
Input
Logic to produce different input statements for:• Fixed column data
• Delimited data
Output Multiple datasets if different record types
• Separate keep dataset options
• Logic to output to appropriate dataset
if left(_RecordSetIdentifier) eq left("2 ") then DO;
output ICPSR2825FAMILY ;
END;
if "&fileStructureType" eq "hierarchical" then do; put 'if left(_RecordSetIdentifier) eq left("' catValu '") then DO;' / " output " safeAgency +(-1) safeIDNo +(-1) rectype ';' / 'END;' //; end;
Some of the Limitations
DDI can describe nCubes, geographic coverage, variable groups• Current makeReader.sas can’t handle these
DDI definition includes recursive elements• E.g. recGrps within recGrps
• Current makeReader,sas would not find nested elements
Questions?
About the Speaker
Larry Hoyle
Associate Scientist
Policy Research Institute,
University of Kansas
1541 Lilac Lane
Lawrence, KS 66044-3177