dental council of india
TRANSCRIPT
-
8/13/2019 Dental Council of India
1/8
Data format translation and migration
Future possibilities
Alasdair Crockett, Data Standards Manager
UK Data Archive
-
8/13/2019 Dental Council of India
2/8
Past problems and future
solutions
Past/existing problemsskeletons in the back catalogue. The
UKDA and other long-standing archives have old studies in column
binary or other legacy formats that are difficult, time consuming andoccasionally practically impossible to process/migrate.
Future solutionsto ensure that we dont store up similar problems
(with vastly increased amounts of data) in 20 or 30 years time
This talk covers Future Solutions
-
8/13/2019 Dental Council of India
3/8
When does data format translation
occur? To enable data processing(validation, etc.)
From ingest format to processing format (this being SPSS in the case of theUK Data Archive)
To ensure long-term preservationFrom processing format to preservation format(s), these being SPSS portableand tab-delimited text (with data dictionary) in the case of the UK DataArchive.
To achieve user-friendly disseminationFrom preservation or processing format to dissemination format of userschoice e.g. STATA, SAS or EXCEL, in addition to the ubiquitous SPSS.
Migrationwhen previously mainstream formats become obscure or newformats are requested by users
-
8/13/2019 Dental Council of India
4/8
What are the potential problems of data
format conversion?At time of processing:
Rounding/truncation of numeric data
Truncation of textual data
Differences in handling internal metadata (differential label lengths,
missing value handling, etc.)
Corruption of specially formatted variables (especially date/timevariables)
Embedded special characters (line feeds, carriage returns, tabs, etc.)
Migration:
all the above and added problems with
Dealing with out of date, unfamiliar and/or, inaccessible formats (e.g.column binary)
-
8/13/2019 Dental Council of India
5/8
The Data Curation Initiative: An XML
standard and conversion utilities for survey data
The Data Curation Initiative(DCI) consists of:
XML Standard:
Open standard for sharing and preserving datasets
Implemented as an XML Schema
Stores all attributes of a survey datasetlabels, missing value
definitions, variable level notes, etc.
Conversion software:
From proprietary formats to DEI (with no data loss)
From DEI to proprietary formats (text file +command file)
File and variable level metadata import/export to DDI XML
schema
-
8/13/2019 Dental Council of India
6/8
-
8/13/2019 Dental Council of India
7/8
Migration strategy
An approach such as the data curation initiativeallows either:
Traditional migration strategiessystematicmigration of whole collection on preservationserver
Migration on requestpreservation versionremains the same but on-the-fly export utilities areupdated to cater for new versions/formats as they
become popular
-
8/13/2019 Dental Council of India
8/8
Doesnt the DDI do this?
Not so far
Could build onto the DDIin any case the DCI
will populate variable level of DDI Some advantages to keeping data and metadata
separate:
Single xml file could become enormous and slow to
parse
Allows communities who dont use the DDI to use the
DCI (and vice versa)