shortcuts to ddi markup automation tools and methods that will save you time and effort – and are...

52
Shortcuts to DDI Shortcuts to DDI Markup automation tools and Markup automation tools and methods that will save you methods that will save you time and effort – and are time and effort – and are fun to use! fun to use!

Post on 19-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Shortcuts to DDI Markup automation tools and methods that will save you time and effort – and are fun to use!

Shortcuts to DDIShortcuts to DDI

Markup automation tools and Markup automation tools and methods that will save you time methods that will save you time and effort – and are fun to use!and effort – and are fun to use!

Page 2: Shortcuts to DDI Markup automation tools and methods that will save you time and effort – and are fun to use!

One rule of thumb:One rule of thumb:

Select and combine strategies for Select and combine strategies for conversion appropriate for your conversion appropriate for your available sources / study available sources / study documentation.documentation.

Page 3: Shortcuts to DDI Markup automation tools and methods that will save you time and effort – and are fun to use!

Different sources will give you different parts of the DDI.

DDI

spss, sas, stata

pdf

text codebook

XML

html

database

Excel

delimited text

osiris, marc, …

Study info

Categories

Quest. textLocations

Freq

DDI

Vargrps

Process the different sources and assemble/merge the result

Page 4: Shortcuts to DDI Markup automation tools and methods that will save you time and effort – and are fun to use!

Most common study Most common study documentation “combo”:documentation “combo”:

Statistical package file(s)Statistical package file(s) Machine-readable codebook and/or Machine-readable codebook and/or

questionnaire: ASCII or PDF questionnaire: ASCII or PDF

Example: ICPSR study no. Example: ICPSR study no. 33563356

Page 5: Shortcuts to DDI Markup automation tools and methods that will save you time and effort – and are fun to use!

Step one: Convert Step one: Convert statistical package file(s)statistical package file(s)

Programs:Programs:1) XCONVERT – free to download at 1) XCONVERT – free to download at

http://sda.berkeley.edu:7502/ddi/tools/http://sda.berkeley.edu:7502/ddi/tools/,,Created by the SDA Project, CSM Program, Created by the SDA Project, CSM Program,

UC Berkeley.UC Berkeley.

2) Nesstar’s Publisher – commercial 2) Nesstar’s Publisher – commercial software, see software, see http://http://www.nesstar.comwww.nesstar.com

/products/publisher/products/publisher

3) Currently, SPSS and SAS are working on 3) Currently, SPSS and SAS are working on tools to directly export to DDItools to directly export to DDI

Page 6: Shortcuts to DDI Markup automation tools and methods that will save you time and effort – and are fun to use!

XCONVERT converts to DDI:XCONVERT converts to DDI:

SPSS dds (syntax)SPSS dds (syntax) SAS dds (syntax)SAS dds (syntax) Stata dds (.do+ dictionary files)Stata dds (.do+ dictionary files)

Resulting DDI markup has no frequencies.Resulting DDI markup has no frequencies.Frequencies may be obtained only when converting Frequencies may be obtained only when converting

to DDIto DDIwith the SDATOXML program, available to SDA with the SDATOXML program, available to SDA subscribers.subscribers.XCONVERT does NOT convert dds for hierarchical XCONVERT does NOT convert dds for hierarchical

data files.data files.

Page 7: Shortcuts to DDI Markup automation tools and methods that will save you time and effort – and are fun to use!

Exercise 1: Convert Stata dds Exercise 1: Convert Stata dds to DDI using XCONVERTto DDI using XCONVERT

Download XCONVERT to same folder Download XCONVERT to same folder where you have your Stata dds files.where you have your Stata dds files.

In a text editor, combine the two In a text editor, combine the two Stata dds files (.do and .dct) in one Stata dds files (.do and .dct) in one single file that you can save as .txtsingle file that you can save as .txt

Conversion command (run in DOS): Conversion command (run in DOS): xconvert –x stata –i inputfile –o xconvert –x stata –i inputfile –o outputfile.xmloutputfile.xml

Page 8: Shortcuts to DDI Markup automation tools and methods that will save you time and effort – and are fun to use!

Nesstar Publisher converts to Nesstar Publisher converts to DDI:DDI:

SPSS dds (syntax)SPSS dds (syntax)(Merge in raw data file to obtain frequencies)(Merge in raw data file to obtain frequencies)

SPSS portable/export SPSS portable/export SPSS systemSPSS system Stata system Stata system (ex.: ICPSR study no. 3740)(ex.: ICPSR study no. 3740)

DDI obtained from system/portable files will haveDDI obtained from system/portable files will haveno column locations. no column locations. Nesstar Publisher does NOT import dds for hierarchicalNesstar Publisher does NOT import dds for hierarchicaldata files.data files.

Page 9: Shortcuts to DDI Markup automation tools and methods that will save you time and effort – and are fun to use!

Exercise 2: Convert SPSS dds Exercise 2: Convert SPSS dds to DDI using Nesstar Publisherto DDI using Nesstar Publisher

Edit your SPSS dds: delete comment box, Edit your SPSS dds: delete comment box, and any other additional lines down to data and any other additional lines down to data list.list.

Make your first line read: DATA LIST/Make your first line read: DATA LIST/ Remove “comment out” star from missing Remove “comment out” star from missing

values section.values section. Save as .spsSave as .sps Import into Nesstar Publisher using “File-Import into Nesstar Publisher using “File-

import” command.import” command. Import ASCII data file using “Data-Insert data Import ASCII data file using “Data-Insert data

matrix from fixed format set” command.matrix from fixed format set” command. Export DDI, or save in .NSDstat format for Export DDI, or save in .NSDstat format for

further additions.further additions.

Page 10: Shortcuts to DDI Markup automation tools and methods that will save you time and effort – and are fun to use!

Step two: Convert PDF Step two: Convert PDF documentation to text documentation to text

formatformat

Use xpdf (available from Use xpdf (available from http://www.foolabs.com/http://www.foolabs.com/))

Command type:Command type:

pdftotext –layout infilename pdftotext –layout infilename outfilenameoutfilename

(Preservation of formatting is NOT guaranteed)(Preservation of formatting is NOT guaranteed)

Page 11: Shortcuts to DDI Markup automation tools and methods that will save you time and effort – and are fun to use!

Exercise 3: Convert PDF Exercise 3: Convert PDF codebook to text formatcodebook to text format

Download xpdf program to same Download xpdf program to same folder as your PDF codebook.folder as your PDF codebook.

Conversion command (run in DOS):Conversion command (run in DOS):

pdftotext –layout infilename pdftotext –layout infilename outfilenameoutfilename

(-layout option increases chances for (-layout option increases chances for preserving regular text format)preserving regular text format)

Page 12: Shortcuts to DDI Markup automation tools and methods that will save you time and effort – and are fun to use!

Step three: Extract from Step three: Extract from codebook, and tag in DDI, codebook, and tag in DDI, question text and other question text and other relevant variable-level relevant variable-level

informationinformation For codebooks with regular format, apply For codebooks with regular format, apply

text-processing techniques – like macros, text-processing techniques – like macros, or regular expressions syntax – in a or regular expressions syntax – in a powerful text editor, like TextPad or powerful text editor, like TextPad or emacs.emacs.

Make sure your final product is well-Make sure your final product is well-formed XML and DDI compliant!!!formed XML and DDI compliant!!!

Page 13: Shortcuts to DDI Markup automation tools and methods that will save you time and effort – and are fun to use!

TextpadTextpad

Textpad is a powerful plain text Textpad is a powerful plain text editor available from editor available from http://www.textpad.comhttp://www.textpad.com

Cost: $16 - $29, depending on Cost: $16 - $29, depending on volumevolume

Includes regular expressions search Includes regular expressions search and replace and other nice featuresand replace and other nice features

Page 14: Shortcuts to DDI Markup automation tools and methods that will save you time and effort – and are fun to use!

Regular ExpressionsRegular Expressions

Regular expressions are a special Regular expressions are a special syntax that describes patterns in a syntax that describes patterns in a text. They appear as strings of text. They appear as strings of ordinary characters which take on ordinary characters which take on special meanings.special meanings.

Page 15: Shortcuts to DDI Markup automation tools and methods that will save you time and effort – and are fun to use!

Regular expressions: examplesRegular expressions: examples

. any single character. any single character [^a] any character, except “a”[^a] any character, except “a” [0-9] any single digit[0-9] any single digit [0-9]{2,4} any sequence of min. 2 and [0-9]{2,4} any sequence of min. 2 and

max. 4max. 4 digitsdigits ^ beginning of line^ beginning of line $ end of line$ end of line + zero or more of preceding + zero or more of preceding characters or expressionscharacters or expressions

Page 16: Shortcuts to DDI Markup automation tools and methods that will save you time and effort – and are fun to use!

Exercise 4: Create DDI file Exercise 4: Create DDI file containing variables names containing variables names

and question textand question text

Open your .txt codebook in TextPadOpen your .txt codebook in TextPad Use regular expressions-based commands, Use regular expressions-based commands,

and other TextPad special features to:and other TextPad special features to: -Delete unnecessary text-Delete unnecessary text -Attach DDI tags to the appropriate sections of -Attach DDI tags to the appropriate sections of

text text (Instructions provided)(Instructions provided) Insert codebook beginning- and end-tags to Insert codebook beginning- and end-tags to

create valid DDI.create valid DDI. Save as .xmlSave as .xml

Page 17: Shortcuts to DDI Markup automation tools and methods that will save you time and effort – and are fun to use!

Step three (continued): Step three (continued): Create variable groupsCreate variable groups

Use Nesstar Publisher’s “Variable Use Nesstar Publisher’s “Variable Groups” feature.Groups” feature.

OR,OR, Use SDA’s VARGROUP script to Use SDA’s VARGROUP script to

produce DDI markup.produce DDI markup.(A word of warning! If using SDA’s VARGROUP, (A word of warning! If using SDA’s VARGROUP,

replace commas with spaces in the DDI output replace commas with spaces in the DDI output file, as commas are not allowed in attributes!) file, as commas are not allowed in attributes!)

Page 18: Shortcuts to DDI Markup automation tools and methods that will save you time and effort – and are fun to use!

Exercise 5:Exercise 5:Create DDI markup for variable Create DDI markup for variable groups using SDA’s VARGROUPgroups using SDA’s VARGROUP

Open your .txt codebook in TextPad.Open your .txt codebook in TextPad. Use regular expressions-based commands, Use regular expressions-based commands,

and other special TextPad features, to produce and other special TextPad features, to produce input file for VARGROUP script (instructions input file for VARGROUP script (instructions provided).provided).

Download VARGROUP program to same folder Download VARGROUP program to same folder as your input file.as your input file.

Conversion command (run in DOS): vargroup –Conversion command (run in DOS): vargroup –i inputfilei inputfile

In TextPad, replace commas with spaces in In TextPad, replace commas with spaces in the DDI output file.the DDI output file.

Page 19: Shortcuts to DDI Markup automation tools and methods that will save you time and effort – and are fun to use!

Step four: Merge or Step four: Merge or combine DDI files to combine DDI files to

generate information-rich generate information-rich codebookcodebook

To combine (attach new sections): Use To combine (attach new sections): Use XML- or text- editing software to insert XML- or text- editing software to insert new sections in the appropriate new sections in the appropriate sequence sequence (but beware of producing invalid (but beware of producing invalid documents!)documents!)..

To merge: Use Nesstar Publisher or To merge: Use Nesstar Publisher or

XSLT.XSLT.

Page 20: Shortcuts to DDI Markup automation tools and methods that will save you time and effort – and are fun to use!

Nesstar Publisher’s merge Nesstar Publisher’s merge featurefeature

Will merge in:Will merge in:

Entire sections of the DDI.Entire sections of the DDI. Individual fields within each section.Individual fields within each section.

Using this feature will enable you to write in newly Using this feature will enable you to write in newly added tags or overlay tags that already have added tags or overlay tags that already have content.content.

Key for merges is <var name=“”>Key for merges is <var name=“”>

Page 21: Shortcuts to DDI Markup automation tools and methods that will save you time and effort – and are fun to use!

Exercise 6: Use Nesstar Publisher Exercise 6: Use Nesstar Publisher to merge DDI to merge DDI filesfiles documenting documenting different parts of the same studydifferent parts of the same study

In Nesstar Publisher, open the In Nesstar Publisher, open the saved .NSDstat file (reimporting the DDI will saved .NSDstat file (reimporting the DDI will result in loss of frequencies).result in loss of frequencies).

Use the “Documentation – Import from DDI” Use the “Documentation – Import from DDI” command, to merge in the Question Text command, to merge in the Question Text file.file.

Use the same command to merge in an Use the same command to merge in an ICPSR catalog record covering Sections 2 ICPSR catalog record covering Sections 2 (Study Description) and 3 (File Description) (Study Description) and 3 (File Description) of the DDI.of the DDI.

Page 22: Shortcuts to DDI Markup automation tools and methods that will save you time and effort – and are fun to use!

ReviewReview

Regular expressions are very powerful Regular expressions are very powerful and worth your time to learnand worth your time to learn

XCONVERT can extract DDI variables and XCONVERT can extract DDI variables and categories (but not frequencies)categories (but not frequencies)

Nesstar can work directly with statistical Nesstar can work directly with statistical data files to extract frequenciesdata files to extract frequencies

Nesstar can merge DDI information from Nesstar can merge DDI information from different sourcesdifferent sources..

Page 23: Shortcuts to DDI Markup automation tools and methods that will save you time and effort – and are fun to use!

AutomationAutomation

Page 24: Shortcuts to DDI Markup automation tools and methods that will save you time and effort – and are fun to use!

AutomationAutomation Approaches to AutomationApproaches to Automation

– PROGRAMMINGPROGRAMMING: Use a programming : Use a programming language such as java, C#, VB, perl, language such as java, C#, VB, perl, PHP, ColdFusionPHP, ColdFusion

– COCOONCOCOON: Use an XML publishing : Use an XML publishing framework such as Apache Cocoon framework such as Apache Cocoon (PLUG)(PLUG)

– UNIXUNIX: Adapt/reuse existing scripts using : Adapt/reuse existing scripts using UNIX (Linux, Mac OS X)-based toolsUNIX (Linux, Mac OS X)-based tools

Page 25: Shortcuts to DDI Markup automation tools and methods that will save you time and effort – and are fun to use!

Automation Automation RecommendationsRecommendations

Use UNIX to glue existing scripts togetherUse UNIX to glue existing scripts together Use XSLTUse XSLT Use Cocoon or scripts to process XMLUse Cocoon or scripts to process XML Code new functionality as necessary, with Code new functionality as necessary, with

command-line wrapperscommand-line wrappers

DDI

Scripts

UNIX

XSLTCocoon

XSLT

IN OUT

Page 26: Shortcuts to DDI Markup automation tools and methods that will save you time and effort – and are fun to use!

Survey of DDI and XML Survey of DDI and XML ToolsTools

ToolTool PlatformsPlatforms SourcesSources ResultsResults License*License*

SDA’s SDA’s XCONVERTXCONVERT, , VARGROUPVARGROUP

UNIX, UNIX, WindowsWindows

Stat Stat package package files (SPSS, files (SPSS, SAS, Stata)SAS, Stata)

DDI (no DDI (no frequencies)frequencies)

freefree

Oracle XML Oracle XML Developer’s Developer’s Kit (Kit (XDKXDK))

UNIX, UNIX, WindowsWindows

XML, XSLTXML, XSLT anyany freefree

DDI_DTD.cifDDI_DTD.cif BlaiseBlaise BlaiseBlaise ““xml”xml” freefree

MSXML 4.0MSXML 4.0 WindowsWindows XML, XSLTXML, XSLT anyany freefree

GESIS GESIS spssoms2ddspssoms2ddii

XSLTXSLT SPSS OMS SPSS OMS XMLXML

DDIDDI GNUGNU

HTML HTML TidyTidy UNIX, UNIX, WindowsWindows

Badly Badly formed htmlformed html

xhtmlxhtml openopen

* Check licensing terms

Page 27: Shortcuts to DDI Markup automation tools and methods that will save you time and effort – and are fun to use!

How do I use XSLT How do I use XSLT stylesheets?stylesheets?

BrowserBrowser (IE and Mozilla) (IE and Mozilla) Programming language (many Programming language (many

libraries and APIs)libraries and APIs) Server (Xalan, Xerces, xt, Saxon)Server (Xalan, Xerces, xt, Saxon) Apache CocoonApache Cocoon Command line (Command line (Oracle XDKOracle XDK or or

MSXML 4.0)MSXML 4.0) Windows shortcutWindows shortcut

Page 28: Shortcuts to DDI Markup automation tools and methods that will save you time and effort – and are fun to use!

Automation Exercise 1Automation Exercise 1

Apply an xslt stylesheet in various Apply an xslt stylesheet in various waysways

Open the folder “xslt” and follow the Open the folder “xslt” and follow the instructions in “oraxsl lesson.txt”instructions in “oraxsl lesson.txt”

Page 29: Shortcuts to DDI Markup automation tools and methods that will save you time and effort – and are fun to use!

XSLT advantagesXSLT advantages When the source is XML, XSLT can output When the source is XML, XSLT can output

to XML, text, pdf, even jpegto XML, text, pdf, even jpeg This might be done directly, or possibly via This might be done directly, or possibly via

an intermediate format and a conversion an intermediate format and a conversion tool/library such as html2pdf, foptool/library such as html2pdf, fop

Cocoon has a large number of such Cocoon has a large number of such libraries built inlibraries built in

XSLT stylesheets can be reused in java, XSLT stylesheets can be reused in java, C#, perl, PHP, ColdFusion.C#, perl, PHP, ColdFusion.

XSLT stylesheets are easier to modify if XSLT stylesheets are easier to modify if the xml changes or needs to be parsed the xml changes or needs to be parsed differentlydifferently

Page 30: Shortcuts to DDI Markup automation tools and methods that will save you time and effort – and are fun to use!

XSLT drawbacksXSLT drawbacks

Not in typical skillset — functional Not in typical skillset — functional programming is different from OO programming is different from OO and proceduraland procedural

Memory hog — the entire document Memory hog — the entire document is loaded into memory and expandedis loaded into memory and expanded– Doc size/content ratio = 20+Doc size/content ratio = 20+– Solutions:Solutions:

Preprocess using streaming parserPreprocess using streaming parser Allot more memoryAllot more memory

– java -Xms<min_size> -Xmx<max_size> java -Xms<min_size> -Xmx<max_size>

Page 31: Shortcuts to DDI Markup automation tools and methods that will save you time and effort – and are fun to use!

A Survey of UNIX ToolsA Survey of UNIX Tools

UNIX Text Processing ToolsUNIX Text Processing Tools– sed, awk, tr, cut, head, …sed, awk, tr, cut, head, …

PipesPipes– Allows the results of one command to be Allows the results of one command to be

sent to anothersent to another UNIX batch commandsUNIX batch commands

– ls, grep, xargsls, grep, xargs UNIX schedulingUNIX scheduling

– croncron

Page 32: Shortcuts to DDI Markup automation tools and methods that will save you time and effort – and are fun to use!

Introduction to sedIntroduction to sed

Sed performs line-by-line Sed performs line-by-line substitutions using regular substitutions using regular expressionsexpressions

sed –f commandsfile sourcefile > sed –f commandsfile sourcefile > destinationfiledestinationfile

Page 33: Shortcuts to DDI Markup automation tools and methods that will save you time and effort – and are fun to use!

Automation Exercise 2Automation Exercise 2

We’ll use sed to duplicate the We’ll use sed to duplicate the functionality of a textpad macro we functionality of a textpad macro we created previouslycreated previously

Open the folder “sed” and follow the Open the folder “sed” and follow the instructions in “sed lesson.txt”instructions in “sed lesson.txt”

WARNING 1: sed’s regular expressions WARNING 1: sed’s regular expressions are slightly different from textpad’sare slightly different from textpad’s

WARNING 2: sed by default processes WARNING 2: sed by default processes line-by-lineline-by-line Sed is available on all unix systems. See

“README_download_instructions” for windows machines

Page 34: Shortcuts to DDI Markup automation tools and methods that will save you time and effort – and are fun to use!

spss, sas, stata

pdf

text codebook

XML

html

database/Excel

delimited text

CAI, Blaise

osiris, marc, …

Sources ReviewSources Review

DDI

textpad

The functionality of textpad on windows can be replaced by sed or awk on UNIX

Automation

Translating manual steps to automated steps

Page 35: Shortcuts to DDI Markup automation tools and methods that will save you time and effort – and are fun to use!

Sources ReviewSources Review

DDI

pdf2text

Textpad/sed

xconvert

The functionality of textpad on windows can be replaced by sed or awk on UNIX

spss, sas, stata

pdf

text codebook

XML

html

database/Excel

delimited text

CAI, Blaise

osiris, marc, …

Page 36: Shortcuts to DDI Markup automation tools and methods that will save you time and effort – and are fun to use!

Automation Exercise 3Automation Exercise 3

Hooking things together with pipes Hooking things together with pipes (or files)(or files)

Open the folder “automate” and Open the folder “automate” and follow the instructions in “automate follow the instructions in “automate lesson.txt”lesson.txt”

Batch processing with ls, sed, grep, Batch processing with ls, sed, grep, and xargsand xargs

Page 37: Shortcuts to DDI Markup automation tools and methods that will save you time and effort – and are fun to use!

Advice for Batch ProcessingAdvice for Batch Processing

Use a consistent naming conventionUse a consistent naming convention Identify the driving filesIdentify the driving files Schedule using cronSchedule using cron

Page 38: Shortcuts to DDI Markup automation tools and methods that will save you time and effort – and are fun to use!

Sources for AutomationSources for Automation

Not every process is suited for Not every process is suited for automationautomation

A process may be partially automatedA process may be partially automated Sources which are formatted in a Sources which are formatted in a

regular manner are ideal for automationregular manner are ideal for automation– Database outputDatabase output– Excel spreadsheetsExcel spreadsheets– Delimited textDelimited text– Machine-generated outputMachine-generated output

Page 39: Shortcuts to DDI Markup automation tools and methods that will save you time and effort – and are fun to use!

Make use of intermediate Make use of intermediate formatsformats

A candidate for an intermediate regular A candidate for an intermediate regular format that already has scripts/tools format that already has scripts/tools written for it can simplify your work.written for it can simplify your work.

Candidates:Candidates:– Delimited textDelimited text– XmlXml– HtmlHtml– Proprietary format (SDA’s DDL, SPSS’s __)Proprietary format (SDA’s DDL, SPSS’s __)

Page 40: Shortcuts to DDI Markup automation tools and methods that will save you time and effort – and are fun to use!

Using the Intermediate Format Using the Intermediate Format Strategy: Example 1Strategy: Example 1

Gesis Gesis spssoms2ddi spssoms2ddi is an example of is an example of using the intermediate format using the intermediate format strategystrategy

SPSS fileSPSS OMS

XML DDI

Spssoms2ddistylesheet

study_oms.spss

This is an example of doing it the right way: SPSS outputs proper XML according to a schema

Page 41: Shortcuts to DDI Markup automation tools and methods that will save you time and effort – and are fun to use!

Using the Intermediate Format Using the Intermediate Format Strategy: Example 2Strategy: Example 2

XCONVERT does not output XCONVERT does not output frequenciesfrequencies

SAS ODS command wrapper displays SAS ODS command wrapper displays output as (badly formed) html tablesoutput as (badly formed) html tables

SASHTML

frequenciesxhtml DDI

ODSHTMLtidy xslt

OracleDelimited text

xsltsqlldr

Page 42: Shortcuts to DDI Markup automation tools and methods that will save you time and effort – and are fun to use!

SAS ODSSAS ODS

SAS ODS is able to output its results SAS ODS is able to output its results as html instead of .lst or .rtf fileas html instead of .lst or .rtf file

Just wrap your run statementJust wrap your run statement

ODS html file=“result.htm”

your sas code …proc print data =new; run;

ODS html close;

Page 43: Shortcuts to DDI Markup automation tools and methods that will save you time and effort – and are fun to use!

SAS ODS HTML outputSAS ODS HTML output

bad html – verbose, mismatched nestingbad html – verbose, mismatched nesting Show exampleShow example Xslt cannot be applied directly to this Xslt cannot be applied directly to this

outputoutput Use HTML tidy (open source) to clean this Use HTML tidy (open source) to clean this

bad html before applying xslt style sheetsbad html before applying xslt style sheets tidy options sourcefile > resultfiletidy options sourcefile > resultfile HTML tidy is built into Apache CocoonHTML tidy is built into Apache Cocoon

Page 44: Shortcuts to DDI Markup automation tools and methods that will save you time and effort – and are fun to use!

Automation Exercise 4Automation Exercise 4

HTML Tidy allows you to deal with HTML Tidy allows you to deal with badly formed xml/html that naturally badly formed xml/html that naturally occur in the real worldoccur in the real world

Open the folder “tidy” and follow the Open the folder “tidy” and follow the instructions in “tidy lesson.txt”instructions in “tidy lesson.txt”

Page 45: Shortcuts to DDI Markup automation tools and methods that will save you time and effort – and are fun to use!

SourcesSources

DDI

pdf2text

sed

xconvert

oraxsl + stylesheetODS

HTML tidy

spss, sas, stata

pdf

text codebook

XML

html

database/Excel

delimited text

CAI, Blaise

osiris, marc, …

Page 46: Shortcuts to DDI Markup automation tools and methods that will save you time and effort – and are fun to use!

Database sourcesDatabase sources

Use intermediate formats such as Use intermediate formats such as xml or htmlxml or html

Some databases can output directly Some databases can output directly to “xml” or “html”, but delimited text to “xml” or “html”, but delimited text is fineis fine

Usually, the “xml” output needs to Usually, the “xml” output needs to be cleaned by HTML tidybe cleaned by HTML tidy

Page 47: Shortcuts to DDI Markup automation tools and methods that will save you time and effort – and are fun to use!

Excel as an editing/automation Excel as an editing/automation tooltool

Excel can read/write delimited textExcel can read/write delimited text Excel can read htmlExcel can read html Excel has macrosExcel has macros Excel rowset demo/exerciseExcel rowset demo/exercise

Page 48: Shortcuts to DDI Markup automation tools and methods that will save you time and effort – and are fun to use!

spss, sas, stata

pdf

text codebook

XML

html

database/Excel

delimited text

CAI, Blaise

osiris, marc, …

SourcesSources

DDI

pdf2text

sed

xconvert

oraxsl + stylesheetODS

HTML tidy

Page 49: Shortcuts to DDI Markup automation tools and methods that will save you time and effort – and are fun to use!

Sources & DestinationsSources & Destinations

DDI

spss, sas, stata

pdf

text codebook

XML

html

database

Excel

delimited text

osiris, marc, …

XS

LT

spss, sas, stata

pdf

text codebook

XML

html

database/Excel

delimited text

CAI, Blaise

osiris, marc, …

Page 50: Shortcuts to DDI Markup automation tools and methods that will save you time and effort – and are fun to use!

DDI to MARCDDI to MARC

Sometimes, XSLT will only get you 99% of the Sometimes, XSLT will only get you 99% of the wayway

MARC output requires control characters which MARC output requires control characters which are illegal in XML/XSLTare illegal in XML/XSLT

Strategy1: output substitute characters and Strategy1: output substitute characters and then use tr or sed to replace control charactersthen use tr or sed to replace control characters

oraxsl 06084.xml 00.xsl temp1.xmloraxsl temp1.xml 00.xsl temp2.xmloraxsl temp2.xml 00.xsl temp3.xmloraxsl temp3.xml 00.xsl temp4.txtsed -f restoreIllChars.sed > 06084.marc

oraxsl $1.xml 00.xsl temp1.xmloraxsl temp1.xml 00.xsl temp2.xmloraxsl temp2.xml 00.xsl temp3.xmloraxsl temp3.xml 00.xsl temp4.txtsed -f restoreIllChars.sed > $1.marcrm -f temp?.xml temp4.txt

Page 51: Shortcuts to DDI Markup automation tools and methods that will save you time and effort – and are fun to use!

DDI to MarcDDI to Marc

Revised strategy: after working with Revised strategy: after working with MARC for a while, we decided that we MARC for a while, we decided that we could make use of existing utilitiescould make use of existing utilities– 1. convert DDI to marcxml (with xslt 1. convert DDI to marcxml (with xslt

stylesheet written at icpsr) using oraxslstylesheet written at icpsr) using oraxsl– 2. convert marcxml to marc21 using marc4j2. convert marcxml to marc21 using marc4j

Marc4j and other marc utilities are Marc4j and other marc utilities are available at available at http://www.loc.gov/marc/marctools.htmlhttp://www.loc.gov/marc/marctools.html

Page 52: Shortcuts to DDI Markup automation tools and methods that will save you time and effort – and are fun to use!

Contact infoContact info

Sanda IonescuSanda Ionescu– [email protected]@icpsr.umich.edu

I-Lin Kuo (until Aug 18)I-Lin Kuo (until Aug 18)– [email protected]@icpsr.umich.edu