clinical genomics at scale: synthesizing and analyzing

Clinical Genomics at Scale: Synthesizing and Analyzing “Big Data” From Thousands of Patients

Brandy Bernard PhD Senior Research Scientist Institute for Systems Biology Seattle, WA Dr. Bernard’s research interests are in cancer drug discovery and clinical genomics. He is currently a part of the ISB Genome Data Analysis Center (GDAC) within The Cancer Genome

Atlas (TCGA) network. During this time, he has developed novel computational methods and analyses in support of TCGA network research and publications, and has provided scientific guidance for the data exploration tools and algorithms developed by the team. Dr. Bernard has led the group’s research efforts and contributions to several TCGA Analysis Working Groups, particularly in the area of heterogeneous data integration and graph analysis. In collaboration with experts in functional genomics he has integrated TCGA and RNAi screening data to prioritize novel targets and tumor types for drug discovery and repurposing. His research in the area of cancer genomics has resulted in several proffered presentations at TCGA symposia and AACR meetings on distinct topics, a First Prize in the YarcData Graph Analytics Challenge, and a Life Science Discovery Fund grant to further the development of our cancer genomics web portals. In the area of clinical genomics, Dr. Bernard co-leads a collaboration with Inova Translational Medicine Institute (ITMI) to provide analytic support and develop scalable infrastructure for the integration of clinical data with whole genome sequences and molecular data from thousands of patients. Related to this effort, Dr. Bernard has worked with the PRE-EMPT Global Pregnancy Collaboration (CoLab) as well as the Crohn’s and Colitis Foundation of America (CCFA) to advise in the study design and infrastructure of large-scale clinical genomics programs.

Annual Quality Congress Breakout Session, Sunday, October 4, 2015 Clinical Genomics at Scale: Synthesizing and Analyzing “Big Data” From Thousands of Patients Objective: Define systems biology and relate this concept to the NICU context.

Clinical Genomics: Scalable Analysis Across Thousands of Patients

Brady Bernard, PhD

October 4, 2015 1

Clinical Genomics:Scalable Analysis Across

Thousands of Patients

Brady Bernard, PhDSr. Research Scientist

Disclosure

• Brady Bernard does not have any financial arrangement or affiliations with a commercial entity.

• Brady Bernard will not be discussing the unlabeled use of a commercial product in her presentation.

Example ‘Big Science’ Projects

• The Cancer Genome Atlas (TCGA)– Genomic and molecular characterization of 30

cancers across thousands of primary tumor samples

• Inova Translational Medicine Institute (ITMI)– Analysis of thousands of whole genome sequences

integrated with clinical data

TCGA Research Network

Clinicians

Researchers

Software engineers

Bioinformaticians

TCGA data and biospecimen flowInova Translational Medicine Institute (ITMI)

• ITMI aims to assemble one of the world’s largest collections of whole genome sequences in a single database to enable personalized healthcare and spur biomedical research

• Example projects:– Families with full and preterm births

– Longitudinal study – first 1000 days of life

– Congenital anomalies


Brady Bernard, PhD

October 4, 2015 2

Hi-level clinical genomics workflow

Clinical Genomic

Analysis

• EMR• Survey• Phenotypes• Data cleansing• Feature merging

• Sequencing• QC/QA• Data management• Annotation

• Characterization (especially clinical)• Genotype/phenotype associations• Clinically relevant prediction• Data integration• Interactive exploration (web portals)

Highly Collaborative

• Study design• Phenotype prioritization• Patient/family selection• Data generation protocols• Focused subgroup meetings• Scalable models (dataset creation,

analysis result exploration, …)• Validation (predictors, variants, …)• Publication and Visualization

Functional coresDiscussion topics

• Clinical

• Genomic & Molecular

• Computational / Informatics / Analysis

Discussion topics

• Clinical



Clinical considerations

• EMR and survey formulation

• Consistency (formalism) in the data

• Organization, LIMS, and metadata

• Aggregation of common data elements

• Precise phenotyping and sample size

• Prediction frameworks


Brady Bernard, PhD

October 4, 2015 3

EMR and Survey formulation

• Timing of events with respect to some reference point– Easy to overlook– Important for analysis and prediction

• Does lack of response mean no, don't know, didn't want to answer

EMR and Survey formulation

• Timing of events with respect to some reference point– Easy to overlook– Important for analysis and prediction

• Does lack of response mean no, don't know, didn't want to answer

Consistency (formalism) in the data

• Structured data dictionary for:– Consistency across clinical or research sites– More seamless automation

• Feature name matching (e.g., data dictionary and column name in eCRF data)

• Misspellings and synonyms (e.g., drugs)

• Mixed delimiters

• Data provenance and versioning– Excel files?

Organization, LIMS, and metadata

• Excel files are tempting, will not work for large consortium

• LIMS systems can capture meta-data in a structured and queryable form

• Metadata examples:– source tissue– known variations across batches (software changes

etc.)– mapped sample ids across data types– date sample was taken– date sample was processed

Aggregation of common data elements

• Different sources of evidence for premature rupture of membranes

– Antenatan_Steroids_Indication:pprom– Antenatan_Steroids_Indication:prom– Delivery_Result_of_Other_Reason:pprom– Delivery_Result_of_Other_Reason:prom– Other_Medical_Conditions:pprom– Other_Medication_Indication:pprom– Prom– Reason_for_C-Section_mc:pprom– Reason_for_C-Section_mc:prom– Tocolytic_Therapy_Indication:pprom– Tocolytic_Therapy_Indication:prom– Was_the_Delivery_a_Result_of:prom

Precise phenotyping and sample size


Brady Bernard, PhD

October 4, 2015 4

Precise phenotyping and sample size Precise phenotyping and sample size

Precise phenotyping and sample sizePrediction frameworks

• Goal: predict phenotypes or outcomes given clinical, genomic, and molecular data

• Non-linearity in data, classifiers should account for this

• Many clinical data elements highly correlated or irrelevant for prediction, should be black-listed

• Cross-validation and independent data sets should be considered in advance

Discussion topics

• Clinical



Genomic & Molecular Data

• Annotations

• Confounding factors– Batch effects [will happen]

– Ancestry [is very important]


Brady Bernard, PhD

October 4, 2015 5

Annotations

• Current, common, and updated:– Reference genome builds– Gene definitions– Software and annotation versions

• Mendelian inheritance errors

• Extremely heterozygous variants

• Missing calls

• Commonly mutated segments

• …

Batch effects: Methylation plate position

Batch effects: Example variant associated with PTB

FTBn=401

195%

PTBn=198

4523%

p = 2e-11

Batch effects: Example variant associated with PTB

A

A

2.0.2.26 2.0.2.26, same pattern in ISB in‐house CGI genomes

Samples ordered by date

100 200 300 400 500 600

variant

variant

preterm

preterm

admixture

admixture

BATCH PTB

Graphic summary Ancestry and population stratification


Brady Bernard, PhD

October 4, 2015 6

Ancestry and population stratification

Population associated variants and class imbalance lead to likely false positives

Family-based genomic study design

• Transmission disequilibrium– Identify variants that are transmitted to effected offspring more

frequently than expected by chance

• Advantage– Accounts for population stratification

• Larger pedigrees can be helpful, though phenotypic and genomic data may not be available

Family genomics: phasing and candidate genes

Roach et al. (2010). Science

Family genomics: phasing and candidate genes

Roach et al. (2010). Science

Mendelian Inheritance Errors (MIEs) Mendelian Inheritance Errors (MIEs)

• Can be real de novo mutations• Most likely explanation is sequencing error

– MIEs, while infrequent, are observed orders of magnitude more than the expected de novo mutation rate


Brady Bernard, PhD

October 4, 2015 7

Accuracy and Variant call quality

99.95

99.96

99.97

99.98

99.99

100

0 20 40 60 80 100 120 140 160 180 200

Percent accuracy (100 ‐%MIE)

Variant call quality score

• Less than 0.05% of all calls are MIE

• Less then 0.002% MIE above quality score of 80

• With family trios, sequencing errors can be identified and spurious associations/type 1 error is mitigated, enabling utility of whole genome sequences in the clinical setting

Genomic & Molecular Data Recommendations

• Clinical analysis can inform genomic study design– Phenotype definition and prioritization is critical

• Matched and balanced cases and controls– minimize batch effects– improve statistical strength– mitigate confounding factors

• Larger segregating pedigrees reduce number of candidate genes –potentially hybrid approaches

• Run nuclear families (and multigenerational, if possible) together on the same batch

• Quality Control• Re-run‘controls’ across batches• Maintain highly detailed annotations on dates, reagents, sequencing runs, software

versions, tissues, …• External data sets from data provider to mitigate batch effects• SNP arrays for sample identifiability and case/control matching

Additional data considerations

• Staging and ‘production’ environments– As data is being generated, upload to staging environment then

QC and structure for analysis/consumption

• Data freezes– Create common data sets, annotation pipelines, and files for

collaborative analysis

• SNP arrays for sample identifiability and case/control matching, especially as number of data types and source sites increases

• De-identification, HIPAA, and Universally unique identifiers (UUIDs)

Discussion topics

• Clinical



Computing and collaborative projects

• Cloud computing– Consolidate data to a centralized source– Scalable computing– Data backup– Minimize IT

• Workflow management systems

• Web portals

NCI cancer cloud pilot


Brady Bernard, PhD

October 4, 2015 8

NCI cancer cloud pilotScalable Genomics Technologies and Architecture

• Quality Control• Re‐process• Batch Assessment• Data Normalization• Organize/Structure

7,000 genomes

Distributed Databases

Google Nearline

Google Genomics / BigQuery

Archival storage100GB/subject Reads (BAM)2GB/subject Variants (VCF)

*Petabyte‐scale data

Google Compute Engine

Annotations

Google Standard

+Billions of Unique VariantsTerabytes of Data (High Access)

Parallel Compute/Analysis

Staging File Server

Bioinformatic Pipelines

• Public (significant ETL and data modeling)

• ISB Proprietary (aggregated over thousands of ‘control’ sequences)

In‐house cluster

Open source workflow management systems

• Assist with provenance, data access, analysis, complex workflows, reproducible science

Web portals

• Project summaries, reports, auditing

• Data access

• Research & dissemination– Interactive exploration & dynamic analysis


Brady Bernard, PhD

October 4, 2015 9

https://itmi.systemsbiology.net/ptb/

Additional computational considerations

• Security/authentication/access controls

• Wikis/project pages (e.g., Confluence) and sub-teams (e.g., analysis working groups)

• Issue and project tracking (e.g., JIRA)

• Listservs and management

• Code repositories (e.g., GitHub)

Discussion topics

• Clinical



Concluding thoughts

• There are many potential [and avoidable] pitfalls

• Infrastructure required to establish and support ‘Big Science’ consortium projects is significant and easy to underestimate– Roles and responsibilities, maintenance, costs, support,

data generation, QC, QA, …

• With TCGA and ITMI, questions to be addressed with the data far exceeds the bandwidth of direct participants– significant value to community in curated clinical, genomic,

and molecular data– what will consortium (and IRB?) guidelines be for access

control and use by the broader community

2bnh: alpha‐beta horseshoe 1hv9: left‐handed beta helix 1m30: SH3‐like barrel

[email protected]

clinical genomics at scale: synthesizing and analyzing

Documents