the challenge of managing access to new and novel forms...

30
The Challenge of Managing Access to New and Novel Forms of Data: An Application of UDC Suzanne Barbalet Discovery Team UK Data Service, University of Essex Nathan Cunningham Associate Director Big Data, UK Data Service, University of Essex International UDC Seminar 2017, London, 14-15 September.

Upload: others

Post on 31-Mar-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Challenge of Managing Access to New and Novel Forms …seminar.udcc.org/2017/files/SBarbalet_NCunningham_UDCSeminar2017_slides.pdfThe Challenge of Managing Access to New and Novel

The Challenge of Managing Access to New

and Novel Forms of Data: An Application of

UDC

Suzanne Barbalet

Discovery Team

UK Data Service, University of Essex

Nathan Cunningham

Associate Director Big Data,

UK Data Service, University of Essex

International UDC Seminar 2017,

London, 14-15 September.

Page 2: The Challenge of Managing Access to New and Novel Forms …seminar.udcc.org/2017/files/SBarbalet_NCunningham_UDCSeminar2017_slides.pdfThe Challenge of Managing Access to New and Novel

St Peters Square, Rome.

Page 3: The Challenge of Managing Access to New and Novel Forms …seminar.udcc.org/2017/files/SBarbalet_NCunningham_UDCSeminar2017_slides.pdfThe Challenge of Managing Access to New and Novel

A working definition of Big Data

Page 4: The Challenge of Managing Access to New and Novel Forms …seminar.udcc.org/2017/files/SBarbalet_NCunningham_UDCSeminar2017_slides.pdfThe Challenge of Managing Access to New and Novel

Defining New and Novel Forms of Data

(2013) OECD report on New Data for Understanding the

Human Condition:

Category A: Data stemming from the transactions of government, for example, tax and social

security systems.

Category B: Data describing official registration or licensing requirements.

Category C: Commercial transactions made by individuals and organisations.

Category D: Internet data, deriving from search and social networking activities.

Category E: Tracking data, monitoring the movement of individuals or physical objects subject to

movement by humans.

Category F: Image data, particularly aerial and satellite images but including land -based video

images.

Page 5: The Challenge of Managing Access to New and Novel Forms …seminar.udcc.org/2017/files/SBarbalet_NCunningham_UDCSeminar2017_slides.pdfThe Challenge of Managing Access to New and Novel

Challenge 1: Managing NNfD

• The literature currently suggests that some of this data is

unmanageable.OECD Global Science Forum, (2013) New Data for Understanding the Human Condition

http://www.biready.com.au/index_htm_files/44553.jpg

Page 6: The Challenge of Managing Access to New and Novel Forms …seminar.udcc.org/2017/files/SBarbalet_NCunningham_UDCSeminar2017_slides.pdfThe Challenge of Managing Access to New and Novel

Challenge 2: Topic Searches on the Web

• Is an acknowledged challenge in the literature and no

less so for the discovery of data.Tay, A. (2016) Managing Volume in Discovery Systems

Jacob, E. K. (2004) Classification and Categorization: A Difference that Makes a Difference

• Our solution is interactive topic access

Page 7: The Challenge of Managing Access to New and Novel Forms …seminar.udcc.org/2017/files/SBarbalet_NCunningham_UDCSeminar2017_slides.pdfThe Challenge of Managing Access to New and Novel

Our approach

• The ambiguity in Big Data

(NNfD) has challenged us to

change our approach for the

nascent and unstructured

nature of NNfD.

• We can’t do rigorous

classification but we can

determine its topics

• We want to look at these

new NNfD with the same

lens as we use for our

curated collection.

• .

Page 8: The Challenge of Managing Access to New and Novel Forms …seminar.udcc.org/2017/files/SBarbalet_NCunningham_UDCSeminar2017_slides.pdfThe Challenge of Managing Access to New and Novel

What is Different about Metadata for Data?An index describes the content of an information source

but when the information source :

• is a measurement not an idea i.e. in its raw state it has

numeric form

then the construction of an index is a complex task…

• it can be dynamic i.e. social

science research captures

attitudes and behaviour in a

time frame

Page 9: The Challenge of Managing Access to New and Novel Forms …seminar.udcc.org/2017/files/SBarbalet_NCunningham_UDCSeminar2017_slides.pdfThe Challenge of Managing Access to New and Novel

Data is a Special Type of Digital Resource

• the content of ‘studies’ will not necessarily relate

semantically to the title of the study without knowledge of

the research indicators and research question – not easy

to index

• an indexer asks what is being measured + what might

be a useful measurement i.e. both variables and

research concepts are indexed

BUT

studies are not difficult to classify… clear subject

information is the general rule … for example…

Page 10: The Challenge of Managing Access to New and Novel Forms …seminar.udcc.org/2017/files/SBarbalet_NCunningham_UDCSeminar2017_slides.pdfThe Challenge of Managing Access to New and Novel

For example:Title: Understanding the Importance of Work Histories in Determining Poverty

in Old Age: Variables Derived from the English Longitudinal Study of Ageing,

2002-2007

Abstract: This study found that for the most part life-course events as

measured here are not strongly associated with the chances of being on a low

income in retirement.

Topics covered include length of time spent in paid work and in

marriage, the timing of retirement, the number of children and timing

of childbirth, and whether ill-health as an adult or as a child had been

experienced. The modelling looks at the influence of these factors

alongside a range of other characteristics such as social class and

educational attainment

Subject

Variable concepts are complex

Page 11: The Challenge of Managing Access to New and Novel Forms …seminar.udcc.org/2017/files/SBarbalet_NCunningham_UDCSeminar2017_slides.pdfThe Challenge of Managing Access to New and Novel

Where does UDC fit in?

• HASSET Thesaurus keywords + a classification code for

each study in our collection allows us to both address

the problem of:• known item retrieval

• topic search [See Tay, A (2016) Managing Volume in Discovery

Systems]

• With legacy classification complete we can now create

bespoke subject categories interactively

AND

• Retrieve specific studies to create automatically

generated subject metadata

Page 12: The Challenge of Managing Access to New and Novel Forms …seminar.udcc.org/2017/files/SBarbalet_NCunningham_UDCSeminar2017_slides.pdfThe Challenge of Managing Access to New and Novel

Transactional or Dynamic Data – can it be

managed?

• How we curate data for the UK Data Service:

• We follow a policy of:

• authenticity

• reliability

• logical integrity

• NNfD has to meet our rigorous standards

THUS

• NNfD does require linking with other data sources to ensure the

provenance, reliability and integrity of the data

• Interactive topic access bridges the divide between NNfD and

other forms of social and economic data in our collection.

Page 13: The Challenge of Managing Access to New and Novel Forms …seminar.udcc.org/2017/files/SBarbalet_NCunningham_UDCSeminar2017_slides.pdfThe Challenge of Managing Access to New and Novel

NNfD can provide valuable research data

• It is important that this NNfD transactional data is

accessed by the same set of thesaurus key words as our

traditional data.

• Thesaurus keywords allocated to studies in the bespoke

category are mined to derive metadata for a particular

NNfD

• Our algorithm here will exclude:

• Keywords for demographic variables (age, gender etc) and

geography keywords

• Outliers are removed

Page 14: The Challenge of Managing Access to New and Novel Forms …seminar.udcc.org/2017/files/SBarbalet_NCunningham_UDCSeminar2017_slides.pdfThe Challenge of Managing Access to New and Novel

NNfD and Social Science Research

• NNfD need not mean “The End of Theory”

Anderson, C. (2008)“The End of Theory: The Data Deluge Makes the Scientific Method Obsolete”

• NNfD offers researchers data that is not subject to:

• the bias of survey data:

• interviewer bias

• sampling bias

• response bias

Page 15: The Challenge of Managing Access to New and Novel Forms …seminar.udcc.org/2017/files/SBarbalet_NCunningham_UDCSeminar2017_slides.pdfThe Challenge of Managing Access to New and Novel

Case Study: Diet and Nutrition

• Transactional data will include such topics as dietary

requirements.

• It is an important topic with considerable policy and

research implications.

• We need to be able to link our metadata across to these

NNfD from our aggregated and survey data to verify and

supplement the primary (NNfD transactional) source.

Page 16: The Challenge of Managing Access to New and Novel Forms …seminar.udcc.org/2017/files/SBarbalet_NCunningham_UDCSeminar2017_slides.pdfThe Challenge of Managing Access to New and Novel

Case Study: Diet and Nutrition

• Collection classified by UDC and indexed with HASSET

keywords.

• This allows us to apply interactive topic access.

• By having the same underpinning RDF SKOS (graph)

methodology utilised by both it enables us to employ

some big data technologies such as NO-SQL tools and

graph based technologies.

Page 17: The Challenge of Managing Access to New and Novel Forms …seminar.udcc.org/2017/files/SBarbalet_NCunningham_UDCSeminar2017_slides.pdfThe Challenge of Managing Access to New and Novel

Case Study: Diet and NutritionRow Labels Sum of Survey Count

ARTIFICIAL SWEETENERS 8

BEVERAGES 17

BODY CIRCUMFERENCE MEASUREMENTS 2

BUTTER 4

CARBOHYDRATES 8

CEREAL PRODUCTS 15

CEREALS 7

CHEESE 6

CHILD NUTRITION 15

CHILD OBESITY 3

CLINICAL TESTS AND MEASUREMENTS 9

COFFEE (BEVERAGE) 8

CONFECTIONERY 17

CONSUMERS 9

CONSUMPTION 14

COOKING 23

DAIRY PRODUCTS 15

DIET (LIFESTYLE) 5

DIET AND EXERCISE 26

DIETARY FIBRE 6

EATING DISORDERS 1

ECONOMIC ACTIVITY 20

EDIBLE FATS 19

EMPLOYMENT 17

ETHNIC GROUPS 19

FISH (AS FOOD) 13

FOOD 41

FOOD ADDITIVES 2

FOOD SUPPLEMENTS 12

FROZEN FOODS 6

FRUIT 23

HEALTH 24

HEALTH FOODS 10

HEIGHT (PHYSIOLOGY) 24

INCOME 13

IRON 2

LEGUMES 7

MALNUTRITION 2

MEALS 17

MEAT 17

MEDICAL DIETS 2

MILK 20

MINERALS 4

NUTRIENTS 15

NUTS 8

OBESITY 4

ORGANIC FOODS 9

PACKETED FOODS 3

POTATOES 5

POULTRY 2

PRESERVED FOODS 5

PROTEINS 8

SALT 13

SAVOURY SNACKS 8

SLIMMING DIETS 5

SOFT DRINKS 7

SOFT FRUIT 1

SPECIAL DIETS 6

SUGAR 19

TEA 8

TINNED FOODS 10

VEGETABLE OILS 4

VEGETABLES 22

VEGETARIANISM 10

VITAMINS 14

WEIGHT (PHYSIOLOGY) 21

WEIGHT CONTROL 4

Page 18: The Challenge of Managing Access to New and Novel Forms …seminar.udcc.org/2017/files/SBarbalet_NCunningham_UDCSeminar2017_slides.pdfThe Challenge of Managing Access to New and Novel

Case Study: Diet and Nutrition

Page 19: The Challenge of Managing Access to New and Novel Forms …seminar.udcc.org/2017/files/SBarbalet_NCunningham_UDCSeminar2017_slides.pdfThe Challenge of Managing Access to New and Novel

Case Study: Diet and Nutrition

• Transactional Data Sources include:

• Apps for fitness

• Diet diaries online

• Telemetry analytics from phones indicating activity such

as walking, cycling flying etc.

• Supermarket loyalty card

• Credit/Debit card administration records

• Dietary preference from airline booking, shopping habits,

conference booking etc

• Subscription transaction to dieting clubs, societies etc

Page 20: The Challenge of Managing Access to New and Novel Forms …seminar.udcc.org/2017/files/SBarbalet_NCunningham_UDCSeminar2017_slides.pdfThe Challenge of Managing Access to New and Novel

Graph Store Approach

Page 21: The Challenge of Managing Access to New and Novel Forms …seminar.udcc.org/2017/files/SBarbalet_NCunningham_UDCSeminar2017_slides.pdfThe Challenge of Managing Access to New and Novel
Page 22: The Challenge of Managing Access to New and Novel Forms …seminar.udcc.org/2017/files/SBarbalet_NCunningham_UDCSeminar2017_slides.pdfThe Challenge of Managing Access to New and Novel

Workflow for NNfD

Page 23: The Challenge of Managing Access to New and Novel Forms …seminar.udcc.org/2017/files/SBarbalet_NCunningham_UDCSeminar2017_slides.pdfThe Challenge of Managing Access to New and Novel

Reference Architecture of DSaaP

• Open source because we can have meaningful common

conversations with the community

• Hadoop is…..

Page 24: The Challenge of Managing Access to New and Novel Forms …seminar.udcc.org/2017/files/SBarbalet_NCunningham_UDCSeminar2017_slides.pdfThe Challenge of Managing Access to New and Novel

Implementation Architecture of DSaaP

Preservation Platform

Deposit Platform Discovery Platform Information Platform

Access PlatformSemantic Platform

Data Platform

Services

Repository

Security

Consumers and Producers

Support

And

Maintenance

Page 25: The Challenge of Managing Access to New and Novel Forms …seminar.udcc.org/2017/files/SBarbalet_NCunningham_UDCSeminar2017_slides.pdfThe Challenge of Managing Access to New and Novel

Five Safes for Securing Data

Page 26: The Challenge of Managing Access to New and Novel Forms …seminar.udcc.org/2017/files/SBarbalet_NCunningham_UDCSeminar2017_slides.pdfThe Challenge of Managing Access to New and Novel

DSaaP Hybrid Service Instances

Common Service Authentication (Kerberos)

AWS

Instance

On premise

Instance

On premise

Instance

Operationalising 5 Safes at Scale

Page 27: The Challenge of Managing Access to New and Novel Forms …seminar.udcc.org/2017/files/SBarbalet_NCunningham_UDCSeminar2017_slides.pdfThe Challenge of Managing Access to New and Novel

Conclusions

• NNfD challenges us to generate metadata innovatively

and each form of NNfD poses separate management

challenges

• Data and metadata are intrinsically related, and this

method allows us to use it, with a common RDF Graph

approach.

• Transactional or dynamic data calls for a dynamic

approach to the generation of metadata

Page 28: The Challenge of Managing Access to New and Novel Forms …seminar.udcc.org/2017/files/SBarbalet_NCunningham_UDCSeminar2017_slides.pdfThe Challenge of Managing Access to New and Novel

…And the future• provide a simple but powerful approach to

adding structured metadata to NNfD so

that we can scale up as our collection

grows.

• support a research community which

includes the international scientific

community, commercial users of big data

and citizens

• a trusted source of information on the use

of new and novel forms of data in

developing impactful research

Page 29: The Challenge of Managing Access to New and Novel Forms …seminar.udcc.org/2017/files/SBarbalet_NCunningham_UDCSeminar2017_slides.pdfThe Challenge of Managing Access to New and Novel

Thank you!

Page 30: The Challenge of Managing Access to New and Novel Forms …seminar.udcc.org/2017/files/SBarbalet_NCunningham_UDCSeminar2017_slides.pdfThe Challenge of Managing Access to New and Novel

Questions

Suzanne Barbalet

[email protected]

Discovery Team, UK Data Service

Nathan Cunningham

[email protected]

http://ukdataservice.ac.uk/about-us/our-rd/big-data-

network-support