session 27 : resources for data management and handling social science data

36
NCRM, Session 27, 1 July 2008 Session 27 : Resources for Data Management and Handling Social Science Data 3 rd ESRC Research Methods Festival, Oxford, 1 July 2008 Workshop organised by the ‘Data Management through e- Social Science’ (DAMES) research Node of the National Centre for e-Social Science www.dames.org.uk / www.ncess.acuk

Upload: jolie-farley

Post on 04-Jan-2016

27 views

Category:

Documents


0 download

DESCRIPTION

Session 27 : Resources for Data Management and Handling Social Science Data. 3 rd ESRC Research Methods Festival, Oxford, 1 July 2008 Workshop organised by the ‘Data Management through e-Social Science’ (DAMES) research Node of the National Centre for e-Social Science - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Session 27 : Resources for Data Management and Handling Social Science Data

NCRM, Session 27, 1 July 2008

Session 27: Resources for Data Management and Handling Social

Science Data

3rd ESRC Research Methods Festival, Oxford, 1 July 2008

Workshop organised by the ‘Data Management through e-Social Science’ (DAMES) research Node of the National

Centre for e-Social Science

www.dames.org.uk / www.ncess.acuk

Page 2: Session 27 : Resources for Data Management and Handling Social Science Data

NCRM, Session 27, 1 July 2008 2

Resources for Data Management and Handling Social Science Data

1400-1430 Key issues, concerns, and the relevance of e-Science (Paul Lambert, Univ. Stirling)

1430-1500 Metadata, say what? (Jesse Blum, Univ. Stirling)

1500-1530 Software for Data Management: The Contribution of Stata (Karen Robson, Geary Inst., Univ. College Dublin)

1600-1630 Helping users see the wood for the trees: ESDS resources for managing and analysing data (Beate Lichtwardt, Univ. Essex)

1630-1700 Social Care Data: Exploring Issues (Alison Dawson & Alison Bowes, Univ. Stirling)

1700-1730 Handling data on occupations, educational qualifications, and ethnicity (Paul Lambert, Univ. Stirling)

Page 3: Session 27 : Resources for Data Management and Handling Social Science Data

NCRM, Session 27, 1 July 2008 3

Data management & handling social science data: Key issues, concerns, & the relevance of e-Science

1) The nature of data management

2) Key issues and concerns good habits and principles challenges

3) The contributions of… e-Social Science the DAMES Node (www.dames.org.uk)

Page 4: Session 27 : Resources for Data Management and Handling Social Science Data

NCRM, Session 27, 1 July 2008 4

‘Data management’ means… ‘the tasks associated with linking related data

resources, with coding and re-coding data in a consistent manner, and with accessing related data resources and combining them within the process of analysis’ […DAMES research Node..]

Usually performed by social scientists themselvesMost overt in quantitative survey data analysis

• ‘variable constructions’, ‘data manipulations’• navigating abundance of data – thousands of variables

Usually a substantial component of the work process

Page 5: Session 27 : Resources for Data Management and Handling Social Science Data

NCRM, Session 27, 1 July 2008 5

Some components…

Manipulating data Recoding categories / ‘operationalising’ variables

Linking data Linking related data (e.g. longitudinal studies) combining / enhancing data (e.g. linking micro- and macro-data)

Secure access to data Linking data with different levels of access permission Detailed access to micro-data cf. access restrictions

Harmonisation standards Approaches to linking ‘concepts’ and ‘measures’ (‘indicators’) Recommendations on particular ‘variable constructions’

Cleaning data ‘missing values’; implausible responses; extreme values

Page 6: Session 27 : Resources for Data Management and Handling Social Science Data

NCRM, Session 27, 1 July 2008 6

Example – recoding data

Count

323 0 0 0 0 323

982 0 0 0 0 982

0 425 0 0 0 425

0 1597 0 0 0 1597

0 0 340 0 0 340

0 0 3434 0 0 3434

0 0 161 0 0 161

0 0 0 1811 0 1811

0 0 0 0 2518 2518

0 0 0 331 0 331

0 0 0 0 421 421

0 0 0 257 0 257

102 0 0 0 0 102

0 0 0 0 2787 2787

138 0 0 0 0 138

1545 2022 3935 2399 5726 15627

-9 Missing or wild

-7 Proxy respondent

1 Higher Degree

2 First Degree

3 Teaching QF

4 Other Higher QF

5 Nursing QF

6 GCE A Levels

7 GCE O Levels or Equiv

8 Commercial QF, No OLevels

9 CSE Grade 2-5,ScotGrade 4-5

10 Apprenticeship

11 Other QF

12 No QF

13 Still At School No QF

Highesteducationalqualification

Total

-9.001.00

Degree2.00

Diploma

3.00 Higherschool orvocational

4.00 Schoollevel orbelow

educ4

Total

Page 7: Session 27 : Resources for Data Management and Handling Social Science Data

NCRM, Session 27, 1 July 2008 7

Example –Linking data Linking via ‘ojbsoc00’ : c1-5 =original data / c6 = derived from data / c7 = derived from www.camsis.stir.ac.uk

Page 8: Session 27 : Resources for Data Management and Handling Social Science Data

NCRM, Session 27, 1 July 2008 8

A bit of focus…

I tend to emphasise two data management activities:

1) Variable constructions o Coding and re-coding values

2) Linking datasetso Internal and external linkages

Page 9: Session 27 : Resources for Data Management and Handling Social Science Data

NCRM, Session 27, 1 July 2008 9

So why this workshop?

1. DM is a big part of the research process ..but receives limited methodological attention

2. Poor practice in soc. sci. DM is easily observed• Not keeping adequate records• Not linking relevant data • Not trying out relevant variable operationalisations

3. Even though.. There are plenty of existing resources and standards

relevant to data management activities There are suitable software and internet facilities People are working on DM support (e.g. DAMES)

Page 10: Session 27 : Resources for Data Management and Handling Social Science Data

NCRM, Session 27, 1 July 2008 10

DAMES research Node

social researchers often spend more time on data management than any other part of the research process

Data access / collection

Data Management

Data Analysis

UK Data ArchiveQualidata

Flagship social surveysOffice for National Statistics

Administrative dataSpecialist academic outputs

DAMESONS supportESDS support NCRM workshops

Essex summer school ESRC RDI initiatives

CQeSS

Page 11: Session 27 : Resources for Data Management and Handling Social Science Data

NCRM, Session 27, 1 July 2008 11

DM: Some further considerations

DM as stumbling block in research conduct UK has ample data, ample analytical resources, but

low levels of exploitation (esp. of complex data)Capacity building aims in DAMES

Lots of previous work in this field ..See below..

‘Data management’ also sometimes means..Data distributors supplying and monitoring use of

particular datasets (e.g. UK Data Archive DM guides)

Page 12: Session 27 : Resources for Data Management and Handling Social Science Data

NCRM, Session 27, 1 July 2008 12

2. Key issues and concerns

(4) good habits and principles

(3) Challenges

..Not solely about survey research..

Page 13: Session 27 : Resources for Data Management and Handling Social Science Data

NCRM, Session 27, 1 July 2008 13

(2.1) Good habit: Keep clear records of your DM activities

Reproducible (for self)Replicable (for all)Paper trail for whole

lifecycleCf. Dale 2006; Freese 2007

In survey research, this means using clearly annotated syntax files (e.g. SPSS/Stata)

Syntax Examples: www.longitudinal.stir.ac.uk

Page 14: Session 27 : Resources for Data Management and Handling Social Science Data

NCRM, Session 27, 1 July 2008 14

Stata syntax example (‘do file’)

Page 15: Session 27 : Resources for Data Management and Handling Social Science Data

NCRM, Session 27, 1 July 2008 15

Some comments on survey analysis software..

“A program like SPSS .. has two main components: the statistical routines, .. and the data management facilities. Perhaps surprisingly, it was the latter that really revolutionised quantitative social research” [Procter, 2001: 253]

“Socio-economic processes require comprehensive approaches as they are very complex (‘everything depends on everything else’). The data and computing power needed to disentangle the multiple mechanisms at work have only just become available.” [Crouchley and Fligelstone 2004]

Page 16: Session 27 : Resources for Data Management and Handling Social Science Data

NCRM, Session 27, 1 July 2008 16

Some personal comments on survey analysis software..

Data management and data analysis must be seen as integrated processesStata is the most effective software, as it

achieves advanced DM and DA functionality and makes good documentation easy

Others argue that more advanced analytical techniques necessitate other packages – I’m not convinced

Page 17: Session 27 : Resources for Data Management and Handling Social Science Data

NCRM, Session 27, 1 July 2008 17

(2.2) Principle: Use existing standards and previous research

Variable operationalisationsUse recognised recodes / standard classifications

• ONS harmonisation standards

• [Shaw et al. 2007]

• Cross-national standards. [Hoffmeyer-Zlotnick & Wolf 2003]

Use reproducible recodes / classifications (paper trail)

Other data file manipulations• Missing data treatments• Matching data files (finding the right data)

Page 18: Session 27 : Resources for Data Management and Handling Social Science Data

NCRM, Session 27, 1 July 2008 18

(2.3) Principle: Do something, not nothing

We currently put much more effort into data collection and data analysis, and neglect data manipulation

Survey research – the influence of ‘what was on the archive version’

…In my experience, a common reason why people didn’t do more DM was because they were frightened to…

Page 19: Session 27 : Resources for Data Management and Handling Social Science Data

NCRM, Session 27, 1 July 2008 19

(2.4) Principle: Learn how to match files

Complex data (complex research) is distributed across different files

In surveys, use a key linking variable for...One-to-one matching

SPSS: match files /file=“file1.sav” /file=“file2.sav” /by=pid. Stata: merge pid using file2.dta

One-to-many matching (‘table distribution’)SPSS: match files /file=“file1.sav” /table=“file2.sav” /by=pid .

Stata: merge pid using file2.dta

Many-to-one matching (‘aggregation’)SPSS: aggregate outfile=“file3.sav” /meaninc=mean(income)

/break=pid. Stata: collapse (mean) meaninc=income, by(pid)

Page 20: Session 27 : Resources for Data Management and Handling Social Science Data

NCRM, Session 27, 1 July 2008 20

Some challenges for data management..

(2.5) Agreeing about variable constructions

Unresolved debates about optimal measures and variables

Esp. in comparative research such as across time, between countries

http://www.longitudinal.stir.ac.uk/variables/

Page 21: Session 27 : Resources for Data Management and Handling Social Science Data

NCRM, Session 27, 1 July 2008 21

Some challenges for data management..

(2.6) Worrying about data security

DM activities could challenge data security Inspecting individual cases Multiple copies of related data files Ability to link with other datasets ‘Hands-on’ model of data review

New and exciting data resources • have more individual information• are more likely to be released with stringent conditions• may jeopardize traditional DM approaches

Page 22: Session 27 : Resources for Data Management and Handling Social Science Data

NCRM, Session 27, 1 July 2008 22

Some challenges for data management..

(2.7) Incentivising documentation / replicability

There is little to press researchers to better document DM, but much to press them not to

• Make DM and its documentation easier?• Reward documentation (e.g. citations)?

Page 23: Session 27 : Resources for Data Management and Handling Social Science Data

NCRM, Session 27, 1 July 2008 23

3) The relevance of e-Science

‘Data management through e-Social Science’

‘E-Science’ refers to adopting a number of particular approaches and standards from computing science, to applied research areas

These approaches include ‘the Grid’; distributed computing; data and computing standardisation; metadata; security; research infrastructures

DAMES (2008-11) – developing services / resources using e-Science approaches which will help social scientists in undertaking data management tasks

Page 24: Session 27 : Resources for Data Management and Handling Social Science Data

NCRM, Session 27, 1 July 2008 24

E-Science and Data Management

E-Science isn’t essential to good DM, but it has capacity to improve

and support conduct of DM… 1. Concern with standards setting

in communication and enhancement of data

2. Linking distributed/heterogeneous/dynamic data Coordinating disparate resources; interrogating live resources

3) Contribution of metadata tools/standards for variable harmonisation and standardisation

4) Linking data subject to different security levels

5) The workflow nature of many DM tasks

Page 25: Session 27 : Resources for Data Management and Handling Social Science Data

NCRM, Session 27, 1 July 2008 25

E.g. of GEODE: Organising and distributing specialist data resources (on occupations)

Page 26: Session 27 : Resources for Data Management and Handling Social Science Data

NCRM, Session 27, 1 July 2008 26

The contribution of DAMES 8 project themes

1.1) Grid Enabled Specialist Data Environments (‘GE*DE’)

2.1) Description, discovery & service use through metadata and data abstraction

1.2) Data resources for micro-simulation on social care data

2.2) Techniques to handle data from multiple sources

1.3) Linking e-Health and social science databases

2.3) Workflow modelling for social science

1.4) Training and interfaces for management of complex survey data

2.4) Security driven data management

Page 27: Session 27 : Resources for Data Management and Handling Social Science Data

NCRM, Session 27, 1 July 2008 27

DAMES agenda

Useful social science provisionsSpecialist data topics – occupations; education

qualifications; ethnicity; social care; health Mainstream packages and accessible resources

To exploit / engage with existing DM resources

In social science – e.g. CESSDA In e-Science – e.g. OGSA-DAI; OMII

Page 28: Session 27 : Resources for Data Management and Handling Social Science Data

NCRM, Session 27, 1 July 2008 28

..End of talk 1..

1400-1430 Key issues, concerns, and the relevance of e-Science (Paul Lambert, Univ. Stirling)

1430-1500 Metadata, say what? (Jesse Blum, Univ. Stirling)

1500-1530 Software for Data Management: The Contribution of Stata (Karen Robson, Geary Inst., Univ. College Dublin)

1600-1630 Helping users see the wood for the trees: ESDS resources for managing and analysing data (Beate Lichtwardt, Univ. Essex)

1630-1700 Social Care Data: Exploring Issues (Alison Dawson & Alison Bowes, Univ. Stirling)

1700-1730 Handling data on occupations, educational qualifications, and ethnicity (Paul Lambert, Univ. Stirling)

Page 29: Session 27 : Resources for Data Management and Handling Social Science Data

NCRM, Session 27, 1 July 2008 29

Appendix

Existing resources – sources and types of support for data management in the social sciences:

Page 30: Session 27 : Resources for Data Management and Handling Social Science Data

NCRM, Session 27, 1 July 2008 30

Existing resources (i): Data providersa) Documentation and metadata files

Page 31: Session 27 : Resources for Data Management and Handling Social Science Data

NCRM, Session 27, 1 July 2008 31

Existing resources (i): Data providers

b) Resources for variables CESSDA PPP on key variables http://www.nsd.uib.no/cessda/project/ UK Question Bank http://qb.soc.surrey.ac.uk/ ONS Harmonisation http://www.statistics.gov.uk/about/data/

c) Resources for datasets UK Census data portal, http://census.ac.uk/ IPUMS international census data facilities, www.ipums.org European Social Survey, www.europeansocialsurvey.org

d) Data manipulations prior to data release Missing data imputation / documentation Survey design / weighting information Influential – most analysts use ‘the archive version’

Page 32: Session 27 : Resources for Data Management and Handling Social Science Data

NCRM, Session 27, 1 July 2008 32

Existing resources (ii) Resource projects / infrastructures

- UK ESDS www.esds.ac.uk ESDS International | ESDS Government ESDS Longitudinal | ESDS Qualidata

- Helpdesks; online instructions; user support..

- UK ESRC NCRM / NCeSS / RDI initiatives- Longitudinal data – www.longitudinal.stir.ac.uk - Linking micro/macro - www.mimas.ac.uk/limmd/

- Other resources / projects / initiatives- EDACwowe - http://recwowe.vitamib.com/datacentre- ….

Page 33: Session 27 : Resources for Data Management and Handling Social Science Data

NCRM, Session 27, 1 July 2008 33

Existing resources (iii) Analytical and software support

Textbooks featuring data management [Levesque 2008] [Sarantakos 2007]

Software training covering DM Stata’s ‘data management’ manual SPSS user group course on syntax and data management,

www.spssusers.co.uk

But generally, sustained marginalisation of DM as a topic Advanced methods texts use simplistic data Advanced software for analysis isn’t usually combined with extended

DM requirements

Page 34: Session 27 : Resources for Data Management and Handling Social Science Data

NCRM, Session 27, 1 July 2008 34

Existing resources (iv) Data analysts’ contributions

Academic researchers often generate and publish their own DM resources, e.g.

Harry Ganzeboom on education and occupations, http://home.fsw.vu.nl/~ganzeboom/pisa/

Provision of whole or partial syntax programming examples

Analysts often drive wider resource provisions related to DM

CAMSIS project on occupational scales, www.camsis.stir.ac.uk

CASMIN project on education and social class

Page 35: Session 27 : Resources for Data Management and Handling Social Science Data

NCRM, Session 27, 1 July 2008 35

Existing resources (v) Literatures on harmonisation and standardisation

National Statistics Institutes’ principles and practices

E.g. ONS www.statistics.gov.uk/about/data/harmonisation/

Cross-national organisationsE.g. UNSTATS - http://unstats.un.org/unsd/class/

Academic studiesE.g. [Harkness et al 2003]; [Hoffmeyer-Zlotnick & Wolf

2003] [Jowell et al. 2007]

Page 36: Session 27 : Resources for Data Management and Handling Social Science Data

NCRM, Session 27, 1 July 2008 36

References

Blossfeld, H. P., & Rohwer, G. (2002). Techniques of Event History Modelling: New Approaches to Causal Analysis, 2nd Edition. Mawah, NJ: Lawrence Erlbaum Associates.

Crouchley, R., & Fligelstone, R. (2004). The Potential for High End Computing in the Social Sciences. Lancaster: Centre for Applied Statistics, Lancaster University, and http://redress.lancs.ac.uk/document-pool/hecsspotential.pdf.

Dale, A. (2006). Quality Issues with Survey Research. International Journal of Social Research Methodology, 9(2), 143-158.

Freese, J. (2007). Replication Standards for Quantitative Social Science: Why Not Sociology? Sociological Methods and Research, 36(2), 2007.

Harkness, J., van de Vijver, F. J. R., & Mohler, P. P. (Eds.). (2003). Cross-Cultural Survey Methods. New York: Wiley.

Hoffmeyer-Zlotnik, J. H. P., & Wolf, C. (Eds.). (2003). Advances in Cross-national Comparison: A European Working Book for Demographic and Socio-economic Variables. Berlin: Kluwer Academic / Plenum Publishers.

Jowell, R., Roberts, C., Fitzgerald, R., & Eva, G. (2007). Measuring Attitudes Cross-Nationally. London: Sage.

Levesque, R., & SPSS Inc. (2008). Programming and Data Management for SPSS 16.0: A Guide for SPSS and SAS users. Chicago, Il.: SPSS Inc.

Procter, M. (2001). Analysing Survey Data. In G. N. Gilbert (Ed.), Researching Social Life, Second Edition (pp. 252-268). London: Sage.

Sarantakos, S. (2007). A Tool Kit for Quantitative Data Analysis Using SPSS. London: Palgrave MacMillan.

Shaw, M., Galobardes, B., Lawlor, D. A., Lynch, J., Wheeler, B., & Davey Smith, G. (2007). The Handbook of Inequality and Socioeconomic Position: Concepts and Measures. Bristol: Policy Press.