big data in uk biobank: opportunities and challenges funders: wellcome trust and medical research...
TRANSCRIPT
Big Data in UK Biobank:Opportunities and Challenges
Funders: Wellcome Trust and Medical Research Council,with Department of Health, Scottish & Welsh Governments,
British Heart Foundation and Diabetes UK
Rory CollinsUK Biobank Principal Investigator
BHF Professor of Medicine & EpidemiologyNuffield Department of Population Health
University of Oxford, UK
UK Biobank Prospective Cohort
• 500,000 UK men and women aged 40-69 years when recruited and assessed during 2006-2010
• Extensive baseline questions and measurements, with stored biological samples (and opportunities to add enhanced assessments in large subsets)
• Repeat assessments over time in subsets of the participants to allow for sources of variation
• General consent for follow-up through all health records and for all types of health research
• Sufficiently large numbers of people developing different conditions to assess causes reliably
Need for prospective studies to be LARGE: CHD versus SBP for 5K vs 50K vs 500K people in the Prospective Studies Collaboration (PSC)
Usual SBP (mmHg)
120 140 160 180
1
2
4
8
16
32
64
128
256Age at risk:
80-89
70-79
60-69
50-59
40-49
500,000 people
Usual SBP (mmHg)
120 140 160 180
1
2
4
8
16
32
64
128
256Age at risk:
80-89
70-79
60-69
50-59
40-49
50,000 people
Usual SBP (mmHg)
120 140 160 180
1
2
4
8
16
32
64
128
256
Age at risk:80-89
70-79
60-69
50-59
40-49
5000 people
Locations ofUK Biobank assessment
centres around the UK (with
people recruited from urban and
rural areas)
UK Biobank: 500,000 participantsaged 40-69 recruited in 2007-10
Age 40-49 119,000
50-59 168,000
60-69 213,000
Gender Male 228,000
Female 270,000
Deprivation More 92,000
Average 166,000
Less 241,000
Generalisability (not representativeness): Heterogeneity of studypopulation allows associations with disease to be studied reliably
Baseline assessment: Questionnaire content
Self-completion: topics Median time
(minutes)Socio-demographics 1.7Ethnicity 0.1Work-employment 1.4Physical activity 4.4Smoking (non-smokers) 0.5
(past/current smokers) 1.5Diet (food frequency)* 4.5Alcohol 1.1Sleep 1.2Sun exposure 1.3Environmental exposures 1.0Early life factors 0.8Family history of common diseases 1.6Reproductive history & screening (women) 2.4
(men) 0.8Sexual history 0.4General health 2.1Past medical history & medications 1.6Noise exposure 1.0Psychological status 4.5Cognitive function tests 10.0Hearing speech-in-noise test 8.0
Total time 52.5
Interview: topics Median time (minutes)
Medical history/medication 3.1Occupation 0.4Other 0.6
Total time 4.1
*Subset of 200,000 participants: repeated daily diet diaries conducted via the internet
Touchscreen and interview questions (plus extra enhancement questions) available at www.ukbiobank.ac.uk
Baseline assessment: Physical measurements (with enhanced measures in large subsets)
All 500,000 participants
• Blood pressure & heart rate
• Height (standing/seated)
• Waist/hip circumference
• Weight/impedance
• Spirometry
• Heel ultrasound
Subset: 175,000 participants
• Hearing test
• Vascular reactivity
Subset: 120,000 participants
• Visual acuity, refractive index & intraocular pressure
Subset: 85,000 participants
• Retinal images & optical coherence tomograms
• Fitness test & ECG limb leads
UK Biobank different types of biological sample:allowing a wide range of different assays
Sample collection tube Fractions collected Potential assays
Na+ EDTA• Plasma• Buffy coat• Red cells
• Plasma proteome and metabonome• Assays of genomic DNA• Membrane lipids and heavy metals
Lithium Heparin (PST) • Plasma• Plasma proteome and metabonome (without haemolysis)
Silica clot accelerator (SST) • Serum• Serum proteome and metabonome (without haemolysis)
Acid citrate dextrose • Whole blood• Assays of DNA extracted from EBV immortalised cell lines• (B-cell transcriptome)
EDTA • Whole blood • Standard haematological parameters
Tempus RNA stabilisation • Whole blood with lysis reagent• Blood transcriptome• Representative transcriptomes of other tissues
Urine • Urine• Urine proteome and metabonome• Gut microbiome
Saliva • Mixed saliva sample• Salivary proteome and metabonome• Salivary microbiome• (Mucosal proteome and metabonome)
Further enhancements of the phenotyping of UK Biobank participants currently being conducted
• Web-based assessments of diet completed
Web-based dietary assessment: 24-hr recall
• Design considerations:
– Easy and quick: takes only 10-15 minutes
– Automated data collection and coding
– Repeatable (capturing seasonal variation)
– Detailed enough to estimate nutrient intake
• Over 200,000 participants completed the questionnaire at least once, and about 90,000 did so more than once
Future web-based assessments for exposures
• Cognitive function
– Repeat assessment of baseline measures
– Broaden cognitive phenotyping with new measures
– Complements enhanced cognitive function assessment that is planned for the imaging assessment visit
• Occupational history
– Information about all previous occupations (not just latest)
– Greater detail on type of work and duration
• Physical activity questionnaire (RPAQ)
– Complement data from activity monitor
Further enhancements of the phenotyping of UK Biobank participants currently being conducted
• Web-based assessments of diet completed; and next to be cognition/mental health (2014)
• Wrist-worn accelerometers to be mailed to all participants who agree to wear one (2013-15)
• ~45% of participants agree to wear one
• Willing participants sent device by mail
• It is to be worn continuously for 7 days
• Returned by mail and data downloaded
• Device cleaned and sent to next participant
• 100K participants from mid-2013 to mid-2015 (50,000 complete data-sets already obtained)
UK Biobank wrist-worn accelerometer
Further enhancements of the phenotyping of UK Biobank participants currently being conducted
• Web-based assessments of diet completed; and next to be cognition/mental health (2014)
• Wrist-worn accelerometers to be mailed to all participants who agree to wear one (2013-15)
• Biobank chip to genotype (GWAS; candidate SNPs; exome) all participants (2013-15)
Genotyping of all UK Biobank participants
• 820K bespoke UK Biobank Affymetrix genotyping chip:
– 250,000 SNPs in a whole-genome array
– 200,000 markers for known risk factor or disease associations, copy number variation, loss of function, and insertions/deletions
– 150,000 exome markers for high proportion of non-synonymous coding variants with allele frequency over 0.02%
• Estimate (“impute”) additional genotypes by combining measured genotypes with reference sequence data
• Researchers can study associations of genotype data with biochemical risk factors and detailed phenotyping from baseline assessment, along with health outcomes
Further enhancements of the phenotyping of UK Biobank participants currently being conducted
• Web-based assessments of diet completed; and next to be cognition/mental health (2014)
• Wrist-worn accelerometers to be mailed to all participants who agree to wear one (2013-15)
• Biobank chip to genotype (GWAS; candidate SNPs; exome) all participants (2013-15)
• Standard panel of assays (e.g. lipids; clotting) on samples from all participants (2014-15)
Rationale for assaying many standard markers in baseline samples from all 500,000 participants
• Cost-effective way of increasing the usability of the resource for researchers, by providing data for:
– Cross-sectional analyses with prevalent disease
– Identification of subsets based on assay values
• Conducting these assays in all of the participants at the same time should facilitate good quality control
• Lower cost for conducting all of these assays at one time rather than in multiple retrievals and assays
• Facilitates management of depletable samples
Consideration of a proposal to conduct assays of biomarkers of infectious disease in all participants
• Request from the international research community to facilitate studies of the associations of infectious agents with disease (in particular, different types of cancer)
• Plan would be to assay a panel of infectious agents (e.g. HPV, Hepatitis B & C, HBV, EBV, H. pylori) in the baseline sample collected from all 500,000 participants
• As with the biochemical and genetic assays that are being conducted, assays of a wide range of infectious agents would increase the efficient use of the resource
• Detailed proposal for funding is now being developed
Further enhancements of the phenotyping of UK Biobank participants currently being conducted
• Web-based assessments of diet completed; and next to be cognition/mental health (2014)
• Wrist-worn accelerometers to be mailed to all participants who agree to wear one (2013-15)
• Biobank chip to genotype (GWAS; candidate SNPs; exome) all participants (2013-15)
• Standard panel of assays (e.g. lipids; clotting) on samples from all participants (2014-15)
• Information from multiple imaging modalities (e.g. brain/heart/body MRI; bone/joint DEXA)
Imaging of 100,000 UK Biobank participants
• MRI of brain, heart and abdomen
• DEXA of bones, joints and body
• Ultrasound of carotid arteries
• Shortened baseline assessment plus more detailed cognitive function tests and ECG to detect rhythm disturbances
Pilot phase: 4-6,000 people in 1 centre (2014-15)
Main phase: 95,000 people in 3 centres (2015-19)
Opportunities for repeat imaging in sub-sets (e.g. as part of MRC’s focus on dementia)
Body Mass Index (BMI) vs Heart Disease and Stroke (PSC:1M people followed for 12 years; Lancet 2009)
15 20 25 30 35 40 5010
20
40
80
160 Heart disease (18 237 deaths)
Stroke (6122 deaths)
Baseline BMI (kg/m2)
Annual deaths
per 1000
(floated so mean = PSC rates at age 65-69)
Adjusted for age, sex, smoking & study; first 5 years of follow-up excluded
At BMI >25: 5 units higher BMI associated with ~40% higher IHD & stroke mortality
At BMI <25: positive association continues for IHD, but not for stroke
Similar age, gender, BMI & % body fat,but different amounts of INTERNAL FAT
5.86 litres of internal Fat
1.65 litres of internal fat
Atrial fibrillation (AF): prevalence and mortalityduring the period between 1993 and 2007
Piccini et al. Circulation: Cardiovascular Quality and Outcomes. 2012
Prevalence: increasing Mortality: little change
Consideration of prolonged cardiac monitoring
• Cardiac arrhythmias (especially AF)
– can indicate significant underlying cardiac disease
– can directly cause significant morbidity and mortality
– important risk factors for cardio-embolic events (esp. stroke)
• Detection requires prolonged monitoring
– many are intermittent (e.g. paroxysmal AF)
– substantial under-detection with standard 12 lead ECG
– AF increases with age (<50 years: <1%; >80 years: 10%+)
• No large-scale population-based prospective studies with prolonged monitoring, so the full extent/impact of AF on health outcomes is likely to have been underestimated
Example of device for prolonged arrhythmia detection
iRhythmZio Patch
•Has been used in 18,000 people•Non-invasive stick-on patch•Comfortable (median wear 12 days)•Can be applied in clinic or at home•Beat-to-beat ECG recording•Validated against reference Holter•Potentially recyclable device chip which stores data for downloading
Planning to pilot feasibility and acceptability during imaging pilot
UK Biobank: Centralised follow-up of health
• Death and cancer registries
• In-patient and out-patient hospital episodes (including psychiatric) and related procedure registries
• Primary care records of health conditions, prescriptions, diagnostic tests and other investigations
• Other health-related: disease registries; dispensing records; imaging; screening; dental records
• Direct to participants: self-reported medical conditions; treatments actually being taken; degree of functional impairment; cognitive and psychological scores
Health outcome data-linkage challenges
• Regulation, bureaucracy, and permissions (despite explicit consent from participants)
• Data transfer, matching and coding queries
• Understanding different data structures
• Mapping between coding systems
• Mapping between different countries
• Presenting outcome data to researchers
– Original outcome codes
– Post-adjudication outcomes
Meaning of coded data from health records
• What do the coded data actually tell us?
• Characteristics of coded data
– How accurate?
– How detailed?
– How complete?
• Do we need to go beyond the coded data?
UK Biobank: Expected numbers of participants developing diseases during long-term follow-up
Condition 2012 2017 2022
Diabetes 10,000 25,000 40,000
MI/CHD death 7,000 17,000 28,000
Stroke 2,000 5,000 9,000
COPD 3,000 8,000 14,000
Breast cancer 2,500 6,000 10,000
Colorectal cancer 1,500 3,500 7,000
Prostate cancer 1,500 3,500 7,000
Lung cancer 800 2,000 4,000
Hip fracture 800 2,500 6,000
Rh. arthritis 800 2,000 3,000
Alzheimer’s 800 3,000 9,000
General strategy for outcome adjudication
• Avoid false positive cases (but tolerate some false negatives)
• Geographical generalisability
• Cost-effectiveness
• Future-proofed
• Scalability
• Staged approach:– Ascertain– Confirm– Classify
Staged approach to outcome adjudication
APPROACH CHARACTERISTICS POSSIBLE DATA SOURCES
ASCERTAINMENTof suspected cases
Cost-effective
Feasible
Scalable
Death registers
Cancer registers
Hospital episodes
Primary care records
Web-based questionnaires
Staged approach to outcome adjudication
APPROACH CHARACTERISTICS POSSIBLE DATA SOURCES
ASCERTAINMENTof suspected cases
Cost-effective
Feasible
Scalable
Death registers
Cancer registers
Hospital episodes
Primary care records
Web-based questionnaires
CONFIRMATIONof “case-ness”
As above, but greater cost/lower feasibility
Cross-referencing e-records
Disease registers
Staged approach to outcome adjudication
APPROACH CHARACTERISTICS POSSIBLE DATA SOURCES
ASCERTAINMENTof suspected cases
Cost-effective
Feasible
Scalable
Death registers
Cancer registers
Hospital episodes
Primary care records
Web-based questionnaires
CONFIRMATIONof “case-ness”
As above, but greater cost/lower feasibility
Cross-referencing e-records
Disease registers
CLASSIFICATIONof disease cases
More involved and costly per case
Review of clinical records
Tumour collections/assays
Specialised databases (e.g. imaging)
Expert Working Groups developing protocols for ascertainment, confirmation and classification
Cancer
Diabetes
Cardiac outcomes
Stroke
Mental health outcomes
Ocular outcomes
Neurodegenerative outcomes
Respiratory outcomes
Musculoskeletal outcomes
Pilots progressing well; preparing for scaling up of algorithms and then for web adjudication
Pilots commencing
Pilots being developed
UK Biobank: Principles of Access
• UK Biobank is available to all bona fide researchers for all types of health-related research that is in public interest
• No preferential or exclusive access (and, in particular, access does not involve “collaboration” with UK Biobank)
• Researchers have to pay for access to the Resource for their proposed research on a cost-recovery basis only
• Access to the biological samples that are limited and depletable will be carefully controlled and coordinated
• Researchers are required to publish their findings and return the data so that other researchers can use them
What makes UK Biobank special?
• PROSPECTIVE: It can assess the full effects of a particular exposure (such as smoking) on all types of health outcome (such as cancer, vascular disease, lung disease, dementia)
• DETAILED: The wide range of questions, measures and samples at baseline allows good assessment of exposures, and outcome adjudication allows good disease classification
• BIG: Inclusion of large number of participants allows reliable assessment of the causes of a wide range of diseases, and of the combined impact of many different exposures
Unique combination ofBREADTH and DEPTH