big data in uk biobank: opportunities and challenges funders: wellcome trust and medical research...

42
Big Data in UK Biobank: Opportunities and Challenges Funders: Wellcome Trust and Medical Research Council, with Department of Health, Scottish & Welsh Governments, British Heart Foundation and Diabetes UK Rory Collins UK Biobank Principal Investigator BHF Professor of Medicine & Epidemiology Nuffield Department of Population Health University of Oxford, UK

Upload: kelly-jordan

Post on 21-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Big Data in UK Biobank:Opportunities and Challenges

Funders: Wellcome Trust and Medical Research Council,with Department of Health, Scottish & Welsh Governments,

British Heart Foundation and Diabetes UK

Rory CollinsUK Biobank Principal Investigator

BHF Professor of Medicine & EpidemiologyNuffield Department of Population Health

University of Oxford, UK

UK Biobank Prospective Cohort

• 500,000 UK men and women aged 40-69 years when recruited and assessed during 2006-2010

• Extensive baseline questions and measurements, with stored biological samples (and opportunities to add enhanced assessments in large subsets)

• Repeat assessments over time in subsets of the participants to allow for sources of variation

• General consent for follow-up through all health records and for all types of health research

• Sufficiently large numbers of people developing different conditions to assess causes reliably

Need for prospective studies to be LARGE: CHD versus SBP for 5K vs 50K vs 500K people in the Prospective Studies Collaboration (PSC)

Usual SBP (mmHg)

120 140 160 180

1

2

4

8

16

32

64

128

256Age at risk:

80-89

70-79

60-69

50-59

40-49

500,000 people

Usual SBP (mmHg)

120 140 160 180

1

2

4

8

16

32

64

128

256Age at risk:

80-89

70-79

60-69

50-59

40-49

50,000 people

Usual SBP (mmHg)

120 140 160 180

1

2

4

8

16

32

64

128

256

Age at risk:80-89

70-79

60-69

50-59

40-49

5000 people

Locations ofUK Biobank assessment

centres around the UK (with

people recruited from urban and

rural areas)

UK Biobank: 500,000 participantsaged 40-69 recruited in 2007-10

Age 40-49 119,000

50-59 168,000

60-69 213,000

Gender Male 228,000

Female 270,000

Deprivation More 92,000

Average 166,000

Less 241,000

Generalisability (not representativeness): Heterogeneity of studypopulation allows associations with disease to be studied reliably

Production line baseline assessment visit(improved throughput; efficient staffing)

Baseline assessment: Questionnaire content

Self-completion: topics Median time

(minutes)Socio-demographics 1.7Ethnicity 0.1Work-employment 1.4Physical activity 4.4Smoking (non-smokers) 0.5

(past/current smokers) 1.5Diet (food frequency)* 4.5Alcohol 1.1Sleep 1.2Sun exposure 1.3Environmental exposures 1.0Early life factors 0.8Family history of common diseases 1.6Reproductive history & screening (women) 2.4

(men) 0.8Sexual history 0.4General health 2.1Past medical history & medications 1.6Noise exposure 1.0Psychological status 4.5Cognitive function tests 10.0Hearing speech-in-noise test 8.0

Total time 52.5

Interview: topics Median time (minutes)

Medical history/medication 3.1Occupation 0.4Other 0.6

Total time 4.1

*Subset of 200,000 participants: repeated daily diet diaries conducted via the internet

Touchscreen and interview questions (plus extra enhancement questions) available at www.ukbiobank.ac.uk

Baseline assessment: Physical measurements (with enhanced measures in large subsets)

All 500,000 participants

• Blood pressure & heart rate

• Height (standing/seated)

• Waist/hip circumference

• Weight/impedance

• Spirometry

• Heel ultrasound

Subset: 175,000 participants

• Hearing test

• Vascular reactivity

Subset: 120,000 participants

• Visual acuity, refractive index & intraocular pressure

Subset: 85,000 participants

• Retinal images & optical coherence tomograms

• Fitness test & ECG limb leads

UK Biobank different types of biological sample:allowing a wide range of different assays

Sample collection tube Fractions collected Potential assays

Na+ EDTA• Plasma• Buffy coat• Red cells

• Plasma proteome and metabonome• Assays of genomic DNA• Membrane lipids and heavy metals

Lithium Heparin (PST) • Plasma• Plasma proteome and metabonome (without haemolysis)

Silica clot accelerator (SST) • Serum• Serum proteome and metabonome (without haemolysis)

Acid citrate dextrose • Whole blood• Assays of DNA extracted from EBV immortalised cell lines• (B-cell transcriptome)

EDTA • Whole blood • Standard haematological parameters

Tempus RNA stabilisation • Whole blood with lysis reagent• Blood transcriptome• Representative transcriptomes of other tissues

Urine • Urine• Urine proteome and metabonome• Gut microbiome

Saliva • Mixed saliva sample• Salivary proteome and metabonome• Salivary microbiome• (Mucosal proteome and metabonome)

Further enhancements of the phenotyping of UK Biobank participants currently being conducted

• Web-based assessments of diet completed

Web-based dietary assessment: 24-hr recall

• Design considerations:

– Easy and quick: takes only 10-15 minutes

– Automated data collection and coding

– Repeatable (capturing seasonal variation)

– Detailed enough to estimate nutrient intake

• Over 200,000 participants completed the questionnaire at least once, and about 90,000 did so more than once

Future web-based assessments for exposures

• Cognitive function

– Repeat assessment of baseline measures

– Broaden cognitive phenotyping with new measures

– Complements enhanced cognitive function assessment that is planned for the imaging assessment visit

• Occupational history

– Information about all previous occupations (not just latest)

– Greater detail on type of work and duration

• Physical activity questionnaire (RPAQ)

– Complement data from activity monitor

Further enhancements of the phenotyping of UK Biobank participants currently being conducted

• Web-based assessments of diet completed; and next to be cognition/mental health (2014)

• Wrist-worn accelerometers to be mailed to all participants who agree to wear one (2013-15)

• ~45% of participants agree to wear one

• Willing participants sent device by mail

• It is to be worn continuously for 7 days

• Returned by mail and data downloaded

• Device cleaned and sent to next participant

• 100K participants from mid-2013 to mid-2015 (50,000 complete data-sets already obtained)

UK Biobank wrist-worn accelerometer

Further enhancements of the phenotyping of UK Biobank participants currently being conducted

• Web-based assessments of diet completed; and next to be cognition/mental health (2014)

• Wrist-worn accelerometers to be mailed to all participants who agree to wear one (2013-15)

• Biobank chip to genotype (GWAS; candidate SNPs; exome) all participants (2013-15)

Genotyping of all UK Biobank participants

• 820K bespoke UK Biobank Affymetrix genotyping chip:

– 250,000 SNPs in a whole-genome array

– 200,000 markers for known risk factor or disease associations, copy number variation, loss of function, and insertions/deletions

– 150,000 exome markers for high proportion of non-synonymous coding variants with allele frequency over 0.02%

• Estimate (“impute”) additional genotypes by combining measured genotypes with reference sequence data

• Researchers can study associations of genotype data with biochemical risk factors and detailed phenotyping from baseline assessment, along with health outcomes

Further enhancements of the phenotyping of UK Biobank participants currently being conducted

• Web-based assessments of diet completed; and next to be cognition/mental health (2014)

• Wrist-worn accelerometers to be mailed to all participants who agree to wear one (2013-15)

• Biobank chip to genotype (GWAS; candidate SNPs; exome) all participants (2013-15)

• Standard panel of assays (e.g. lipids; clotting) on samples from all participants (2014-15)

Rationale for assaying many standard markers in baseline samples from all 500,000 participants  

• Cost-effective way of increasing the usability of the resource for researchers, by providing data for:

– Cross-sectional analyses with prevalent disease

– Identification of subsets based on assay values

• Conducting these assays in all of the participants at the same time should facilitate good quality control

• Lower cost for conducting all of these assays at one time rather than in multiple retrievals and assays

• Facilitates management of depletable samples

Consideration of a proposal to conduct assays of biomarkers of infectious disease in all participants

• Request from the international research community to facilitate studies of the associations of infectious agents with disease (in particular, different types of cancer)

• Plan would be to assay a panel of infectious agents (e.g. HPV, Hepatitis B & C, HBV, EBV, H. pylori) in the baseline sample collected from all 500,000 participants

• As with the biochemical and genetic assays that are being conducted, assays of a wide range of infectious agents would increase the efficient use of the resource

• Detailed proposal for funding is now being developed

Further enhancements of the phenotyping of UK Biobank participants currently being conducted

• Web-based assessments of diet completed; and next to be cognition/mental health (2014)

• Wrist-worn accelerometers to be mailed to all participants who agree to wear one (2013-15)

• Biobank chip to genotype (GWAS; candidate SNPs; exome) all participants (2013-15)

• Standard panel of assays (e.g. lipids; clotting) on samples from all participants (2014-15)

• Information from multiple imaging modalities (e.g. brain/heart/body MRI; bone/joint DEXA)

Imaging of 100,000 UK Biobank participants

• MRI of brain, heart and abdomen

• DEXA of bones, joints and body

• Ultrasound of carotid arteries

• Shortened baseline assessment plus more detailed cognitive function tests and ECG to detect rhythm disturbances

Pilot phase: 4-6,000 people in 1 centre (2014-15)

Main phase: 95,000 people in 3 centres (2015-19)

Opportunities for repeat imaging in sub-sets (e.g. as part of MRC’s focus on dementia)

Body Mass Index (BMI) vs Heart Disease and Stroke (PSC:1M people followed for 12 years; Lancet 2009)

15 20 25 30 35 40 5010

20

40

80

160 Heart disease (18 237 deaths)

Stroke (6122 deaths)

Baseline BMI (kg/m2)

Annual deaths

per 1000

(floated so mean = PSC rates at age 65-69)

Adjusted for age, sex, smoking & study; first 5 years of follow-up excluded

At BMI >25: 5 units higher BMI associated with ~40% higher IHD & stroke mortality

At BMI <25: positive association continues for IHD, but not for stroke

Similar age, gender, BMI & % body fat,but different amounts of INTERNAL FAT

5.86 litres of internal Fat

1.65 litres of internal fat

Atrial fibrillation (AF): prevalence and mortalityduring the period between 1993 and 2007

Piccini et al. Circulation: Cardiovascular Quality and Outcomes. 2012

Prevalence: increasing Mortality: little change

Consideration of prolonged cardiac monitoring

• Cardiac arrhythmias (especially AF)

– can indicate significant underlying cardiac disease

– can directly cause significant morbidity and mortality

– important risk factors for cardio-embolic events (esp. stroke)

• Detection requires prolonged monitoring

– many are intermittent (e.g. paroxysmal AF)

– substantial under-detection with standard 12 lead ECG

– AF increases with age (<50 years: <1%; >80 years: 10%+)

• No large-scale population-based prospective studies with prolonged monitoring, so the full extent/impact of AF on health outcomes is likely to have been underestimated

Example of device for prolonged arrhythmia detection

iRhythmZio Patch

•Has been used in 18,000 people•Non-invasive stick-on patch•Comfortable (median wear 12 days)•Can be applied in clinic or at home•Beat-to-beat ECG recording•Validated against reference Holter•Potentially recyclable device chip which stores data for downloading

Planning to pilot feasibility and acceptability during imaging pilot

UK Biobank: Centralised follow-up of health

• Death and cancer registries

• In-patient and out-patient hospital episodes (including psychiatric) and related procedure registries

• Primary care records of health conditions, prescriptions, diagnostic tests and other investigations

• Other health-related: disease registries; dispensing records; imaging; screening; dental records

• Direct to participants: self-reported medical conditions; treatments actually being taken; degree of functional impairment; cognitive and psychological scores

Health outcome data-linkage challenges

• Regulation, bureaucracy, and permissions (despite explicit consent from participants)

• Data transfer, matching and coding queries

• Understanding different data structures

• Mapping between coding systems

• Mapping between different countries

• Presenting outcome data to researchers

– Original outcome codes

– Post-adjudication outcomes

Progress with UK-wide linkage to outcome data (both before and after baseline assessment)

Meaning of coded data from health records

• What do the coded data actually tell us?

• Characteristics of coded data

– How accurate?

– How detailed?

– How complete?

• Do we need to go beyond the coded data?

UK Biobank: Expected numbers of participants developing diseases during long-term follow-up

Condition 2012 2017 2022

Diabetes 10,000 25,000 40,000

MI/CHD death 7,000 17,000 28,000

Stroke 2,000 5,000 9,000

COPD 3,000 8,000 14,000

Breast cancer 2,500 6,000 10,000

Colorectal cancer 1,500 3,500 7,000

Prostate cancer 1,500 3,500 7,000

Lung cancer 800 2,000 4,000

Hip fracture 800 2,500 6,000

Rh. arthritis 800 2,000 3,000

Alzheimer’s 800 3,000 9,000

General strategy for outcome adjudication

• Avoid false positive cases (but tolerate some false negatives)

• Geographical generalisability

• Cost-effectiveness

• Future-proofed

• Scalability

• Staged approach:– Ascertain– Confirm– Classify

Staged approach to outcome adjudication

APPROACH CHARACTERISTICS POSSIBLE DATA SOURCES

ASCERTAINMENTof suspected cases

Cost-effective

Feasible

Scalable

Death registers

Cancer registers

Hospital episodes

Primary care records

Web-based questionnaires

Staged approach to outcome adjudication

APPROACH CHARACTERISTICS POSSIBLE DATA SOURCES

ASCERTAINMENTof suspected cases

Cost-effective

Feasible

Scalable

Death registers

Cancer registers

Hospital episodes

Primary care records

Web-based questionnaires

CONFIRMATIONof “case-ness”

As above, but greater cost/lower feasibility

Cross-referencing e-records

Disease registers

Staged approach to outcome adjudication

APPROACH CHARACTERISTICS POSSIBLE DATA SOURCES

ASCERTAINMENTof suspected cases

Cost-effective

Feasible

Scalable

Death registers

Cancer registers

Hospital episodes

Primary care records

Web-based questionnaires

CONFIRMATIONof “case-ness”

As above, but greater cost/lower feasibility

Cross-referencing e-records

Disease registers

CLASSIFICATIONof disease cases

More involved and costly per case

Review of clinical records

Tumour collections/assays

Specialised databases (e.g. imaging)

Expert Working Groups developing protocols for ascertainment, confirmation and classification

Cancer

Diabetes

Cardiac outcomes

Stroke

Mental health outcomes

Ocular outcomes

Neurodegenerative outcomes

Respiratory outcomes

Musculoskeletal outcomes

Pilots progressing well; preparing for scaling up of algorithms and then for web adjudication

Pilots commencing

Pilots being developed

UK Biobank: Principles of Access

• UK Biobank is available to all bona fide researchers for all types of health-related research that is in public interest

• No preferential or exclusive access (and, in particular, access does not involve “collaboration” with UK Biobank)

• Researchers have to pay for access to the Resource for their proposed research on a cost-recovery basis only

• Access to the biological samples that are limited and depletable will be carefully controlled and coordinated

• Researchers are required to publish their findings and return the data so that other researchers can use them

“Showcase”: e-catalogue of data itemscurrently in the UK Biobank Resource

(www.ukbiobank.ac.uk)

Showcase supports search strategies for data items in the UK Biobank Resource

Body Composition: % Body Fat

Preliminary applications subdivided by type of researcher, location and type of research

What makes UK Biobank special?

• PROSPECTIVE: It can assess the full effects of a particular exposure (such as smoking) on all types of health outcome (such as cancer, vascular disease, lung disease, dementia)

• DETAILED: The wide range of questions, measures and samples at baseline allows good assessment of exposures, and outcome adjudication allows good disease classification

• BIG: Inclusion of large number of participants allows reliable assessment of the causes of a wide range of diseases, and of the combined impact of many different exposures

Unique combination ofBREADTH and DEPTH