The Challenge of Managing Access to New
and Novel Forms of Data: An Application of
UDC
Suzanne Barbalet
Discovery Team
UK Data Service, University of Essex
Nathan Cunningham
Associate Director Big Data,
UK Data Service, University of Essex
International UDC Seminar 2017,
London, 14-15 September.
St Peters Square, Rome.
A working definition of Big Data
Defining New and Novel Forms of Data
(2013) OECD report on New Data for Understanding the
Human Condition:
Category A: Data stemming from the transactions of government, for example, tax and social
security systems.
Category B: Data describing official registration or licensing requirements.
Category C: Commercial transactions made by individuals and organisations.
Category D: Internet data, deriving from search and social networking activities.
Category E: Tracking data, monitoring the movement of individuals or physical objects subject to
movement by humans.
Category F: Image data, particularly aerial and satellite images but including land -based video
images.
Challenge 1: Managing NNfD
• The literature currently suggests that some of this data is
unmanageable.OECD Global Science Forum, (2013) New Data for Understanding the Human Condition
http://www.biready.com.au/index_htm_files/44553.jpg
Challenge 2: Topic Searches on the Web
• Is an acknowledged challenge in the literature and no
less so for the discovery of data.Tay, A. (2016) Managing Volume in Discovery Systems
Jacob, E. K. (2004) Classification and Categorization: A Difference that Makes a Difference
• Our solution is interactive topic access
Our approach
• The ambiguity in Big Data
(NNfD) has challenged us to
change our approach for the
nascent and unstructured
nature of NNfD.
• We can’t do rigorous
classification but we can
determine its topics
• We want to look at these
new NNfD with the same
lens as we use for our
curated collection.
• .
What is Different about Metadata for Data?An index describes the content of an information source
but when the information source :
• is a measurement not an idea i.e. in its raw state it has
numeric form
then the construction of an index is a complex task…
• it can be dynamic i.e. social
science research captures
attitudes and behaviour in a
time frame
Data is a Special Type of Digital Resource
• the content of ‘studies’ will not necessarily relate
semantically to the title of the study without knowledge of
the research indicators and research question – not easy
to index
• an indexer asks what is being measured + what might
be a useful measurement i.e. both variables and
research concepts are indexed
BUT
studies are not difficult to classify… clear subject
information is the general rule … for example…
For example:Title: Understanding the Importance of Work Histories in Determining Poverty
in Old Age: Variables Derived from the English Longitudinal Study of Ageing,
2002-2007
Abstract: This study found that for the most part life-course events as
measured here are not strongly associated with the chances of being on a low
income in retirement.
Topics covered include length of time spent in paid work and in
marriage, the timing of retirement, the number of children and timing
of childbirth, and whether ill-health as an adult or as a child had been
experienced. The modelling looks at the influence of these factors
alongside a range of other characteristics such as social class and
educational attainment
Subject
Variable concepts are complex
Where does UDC fit in?
• HASSET Thesaurus keywords + a classification code for
each study in our collection allows us to both address
the problem of:• known item retrieval
• topic search [See Tay, A (2016) Managing Volume in Discovery
Systems]
• With legacy classification complete we can now create
bespoke subject categories interactively
AND
• Retrieve specific studies to create automatically
generated subject metadata
Transactional or Dynamic Data – can it be
managed?
• How we curate data for the UK Data Service:
• We follow a policy of:
• authenticity
• reliability
• logical integrity
• NNfD has to meet our rigorous standards
THUS
• NNfD does require linking with other data sources to ensure the
provenance, reliability and integrity of the data
• Interactive topic access bridges the divide between NNfD and
other forms of social and economic data in our collection.
NNfD can provide valuable research data
• It is important that this NNfD transactional data is
accessed by the same set of thesaurus key words as our
traditional data.
• Thesaurus keywords allocated to studies in the bespoke
category are mined to derive metadata for a particular
NNfD
• Our algorithm here will exclude:
• Keywords for demographic variables (age, gender etc) and
geography keywords
• Outliers are removed
NNfD and Social Science Research
• NNfD need not mean “The End of Theory”
Anderson, C. (2008)“The End of Theory: The Data Deluge Makes the Scientific Method Obsolete”
• NNfD offers researchers data that is not subject to:
• the bias of survey data:
• interviewer bias
• sampling bias
• response bias
Case Study: Diet and Nutrition
• Transactional data will include such topics as dietary
requirements.
• It is an important topic with considerable policy and
research implications.
• We need to be able to link our metadata across to these
NNfD from our aggregated and survey data to verify and
supplement the primary (NNfD transactional) source.
Case Study: Diet and Nutrition
• Collection classified by UDC and indexed with HASSET
keywords.
• This allows us to apply interactive topic access.
• By having the same underpinning RDF SKOS (graph)
methodology utilised by both it enables us to employ
some big data technologies such as NO-SQL tools and
graph based technologies.
Case Study: Diet and NutritionRow Labels Sum of Survey Count
ARTIFICIAL SWEETENERS 8
BEVERAGES 17
BODY CIRCUMFERENCE MEASUREMENTS 2
BUTTER 4
CARBOHYDRATES 8
CEREAL PRODUCTS 15
CEREALS 7
CHEESE 6
CHILD NUTRITION 15
CHILD OBESITY 3
CLINICAL TESTS AND MEASUREMENTS 9
COFFEE (BEVERAGE) 8
CONFECTIONERY 17
CONSUMERS 9
CONSUMPTION 14
COOKING 23
DAIRY PRODUCTS 15
DIET (LIFESTYLE) 5
DIET AND EXERCISE 26
DIETARY FIBRE 6
EATING DISORDERS 1
ECONOMIC ACTIVITY 20
EDIBLE FATS 19
EMPLOYMENT 17
ETHNIC GROUPS 19
FISH (AS FOOD) 13
FOOD 41
FOOD ADDITIVES 2
FOOD SUPPLEMENTS 12
FROZEN FOODS 6
FRUIT 23
HEALTH 24
HEALTH FOODS 10
HEIGHT (PHYSIOLOGY) 24
INCOME 13
IRON 2
LEGUMES 7
MALNUTRITION 2
MEALS 17
MEAT 17
MEDICAL DIETS 2
MILK 20
MINERALS 4
NUTRIENTS 15
NUTS 8
OBESITY 4
ORGANIC FOODS 9
PACKETED FOODS 3
POTATOES 5
POULTRY 2
PRESERVED FOODS 5
PROTEINS 8
SALT 13
SAVOURY SNACKS 8
SLIMMING DIETS 5
SOFT DRINKS 7
SOFT FRUIT 1
SPECIAL DIETS 6
SUGAR 19
TEA 8
TINNED FOODS 10
VEGETABLE OILS 4
VEGETABLES 22
VEGETARIANISM 10
VITAMINS 14
WEIGHT (PHYSIOLOGY) 21
WEIGHT CONTROL 4
Case Study: Diet and Nutrition
Case Study: Diet and Nutrition
• Transactional Data Sources include:
• Apps for fitness
• Diet diaries online
• Telemetry analytics from phones indicating activity such
as walking, cycling flying etc.
• Supermarket loyalty card
• Credit/Debit card administration records
• Dietary preference from airline booking, shopping habits,
conference booking etc
• Subscription transaction to dieting clubs, societies etc
Graph Store Approach
Workflow for NNfD
Reference Architecture of DSaaP
• Open source because we can have meaningful common
conversations with the community
• Hadoop is…..
Implementation Architecture of DSaaP
Preservation Platform
Deposit Platform Discovery Platform Information Platform
Access PlatformSemantic Platform
Data Platform
Services
Repository
Security
Consumers and Producers
Support
And
Maintenance
Five Safes for Securing Data
DSaaP Hybrid Service Instances
Common Service Authentication (Kerberos)
AWS
Instance
On premise
Instance
On premise
Instance
Operationalising 5 Safes at Scale
Conclusions
• NNfD challenges us to generate metadata innovatively
and each form of NNfD poses separate management
challenges
• Data and metadata are intrinsically related, and this
method allows us to use it, with a common RDF Graph
approach.
• Transactional or dynamic data calls for a dynamic
approach to the generation of metadata
…And the future• provide a simple but powerful approach to
adding structured metadata to NNfD so
that we can scale up as our collection
grows.
• support a research community which
includes the international scientific
community, commercial users of big data
and citizens
• a trusted source of information on the use
of new and novel forms of data in
developing impactful research
Thank you!
Questions
Suzanne Barbalet
Discovery Team, UK Data Service
Nathan Cunningham
http://ukdataservice.ac.uk/about-us/our-rd/big-data-
network-support