clinical genomics at scale: synthesizing and analyzing
TRANSCRIPT
Clinical Genomics at Scale: Synthesizing and Analyzing “Big Data” From Thousands of Patients
Brandy Bernard PhD Senior Research Scientist Institute for Systems Biology Seattle, WA Dr. Bernard’s research interests are in cancer drug discovery and clinical genomics. He is currently a part of the ISB Genome Data Analysis Center (GDAC) within The Cancer Genome
Atlas (TCGA) network. During this time, he has developed novel computational methods and analyses in support of TCGA network research and publications, and has provided scientific guidance for the data exploration tools and algorithms developed by the team. Dr. Bernard has led the group’s research efforts and contributions to several TCGA Analysis Working Groups, particularly in the area of heterogeneous data integration and graph analysis. In collaboration with experts in functional genomics he has integrated TCGA and RNAi screening data to prioritize novel targets and tumor types for drug discovery and repurposing. His research in the area of cancer genomics has resulted in several proffered presentations at TCGA symposia and AACR meetings on distinct topics, a First Prize in the YarcData Graph Analytics Challenge, and a Life Science Discovery Fund grant to further the development of our cancer genomics web portals. In the area of clinical genomics, Dr. Bernard co-leads a collaboration with Inova Translational Medicine Institute (ITMI) to provide analytic support and develop scalable infrastructure for the integration of clinical data with whole genome sequences and molecular data from thousands of patients. Related to this effort, Dr. Bernard has worked with the PRE-EMPT Global Pregnancy Collaboration (CoLab) as well as the Crohn’s and Colitis Foundation of America (CCFA) to advise in the study design and infrastructure of large-scale clinical genomics programs.
Annual Quality Congress Breakout Session, Sunday, October 4, 2015 Clinical Genomics at Scale: Synthesizing and Analyzing “Big Data” From Thousands of Patients Objective: Define systems biology and relate this concept to the NICU context.
Clinical Genomics: Scalable Analysis Across Thousands of Patients
Brady Bernard, PhD
October 4, 2015 1
Clinical Genomics:Scalable Analysis Across
Thousands of Patients
Brady Bernard, PhDSr. Research Scientist
Disclosure
• Brady Bernard does not have any financial arrangement or affiliations with a commercial entity.
• Brady Bernard will not be discussing the unlabeled use of a commercial product in her presentation.
Example ‘Big Science’ Projects
• The Cancer Genome Atlas (TCGA)– Genomic and molecular characterization of 30
cancers across thousands of primary tumor samples
• Inova Translational Medicine Institute (ITMI)– Analysis of thousands of whole genome sequences
integrated with clinical data
TCGA Research Network
Clinicians
Researchers
Software engineers
Bioinformaticians
TCGA data and biospecimen flowInova Translational Medicine Institute (ITMI)
• ITMI aims to assemble one of the world’s largest collections of whole genome sequences in a single database to enable personalized healthcare and spur biomedical research
• Example projects:– Families with full and preterm births
– Longitudinal study – first 1000 days of life
– Congenital anomalies
Clinical Genomics: Scalable Analysis Across Thousands of Patients
Brady Bernard, PhD
October 4, 2015 2
Hi-level clinical genomics workflow
Clinical Genomic
Analysis
• EMR• Survey• Phenotypes• Data cleansing• Feature merging
• Sequencing• QC/QA• Data management• Annotation
• Characterization (especially clinical)• Genotype/phenotype associations• Clinically relevant prediction• Data integration• Interactive exploration (web portals)
Highly Collaborative
• Study design• Phenotype prioritization• Patient/family selection• Data generation protocols• Focused subgroup meetings• Scalable models (dataset creation,
analysis result exploration, …)• Validation (predictors, variants, …)• Publication and Visualization
Functional coresDiscussion topics
• Clinical
• Genomic & Molecular
• Computational / Informatics / Analysis
Discussion topics
• Clinical
• Genomic & Molecular
• Computational / Informatics / Analysis
Clinical considerations
• EMR and survey formulation
• Consistency (formalism) in the data
• Organization, LIMS, and metadata
• Aggregation of common data elements
• Precise phenotyping and sample size
• Prediction frameworks
Clinical Genomics: Scalable Analysis Across Thousands of Patients
Brady Bernard, PhD
October 4, 2015 3
EMR and Survey formulation
• Timing of events with respect to some reference point– Easy to overlook– Important for analysis and prediction
• Does lack of response mean no, don't know, didn't want to answer
EMR and Survey formulation
• Timing of events with respect to some reference point– Easy to overlook– Important for analysis and prediction
• Does lack of response mean no, don't know, didn't want to answer
Consistency (formalism) in the data
• Structured data dictionary for:– Consistency across clinical or research sites– More seamless automation
• Feature name matching (e.g., data dictionary and column name in eCRF data)
• Misspellings and synonyms (e.g., drugs)
• Mixed delimiters
• Data provenance and versioning– Excel files?
Organization, LIMS, and metadata
• Excel files are tempting, will not work for large consortium
• LIMS systems can capture meta-data in a structured and queryable form
• Metadata examples:– source tissue– known variations across batches (software changes
etc.)– mapped sample ids across data types– date sample was taken– date sample was processed
Aggregation of common data elements
• Different sources of evidence for premature rupture of membranes
– Antenatan_Steroids_Indication:pprom– Antenatan_Steroids_Indication:prom– Delivery_Result_of_Other_Reason:pprom– Delivery_Result_of_Other_Reason:prom– Other_Medical_Conditions:pprom– Other_Medication_Indication:pprom– Prom– Reason_for_C-Section_mc:pprom– Reason_for_C-Section_mc:prom– Tocolytic_Therapy_Indication:pprom– Tocolytic_Therapy_Indication:prom– Was_the_Delivery_a_Result_of:prom
Precise phenotyping and sample size
Clinical Genomics: Scalable Analysis Across Thousands of Patients
Brady Bernard, PhD
October 4, 2015 4
Precise phenotyping and sample size Precise phenotyping and sample size
Precise phenotyping and sample sizePrediction frameworks
• Goal: predict phenotypes or outcomes given clinical, genomic, and molecular data
• Non-linearity in data, classifiers should account for this
• Many clinical data elements highly correlated or irrelevant for prediction, should be black-listed
• Cross-validation and independent data sets should be considered in advance
Discussion topics
• Clinical
• Genomic & Molecular
• Computational / Informatics / Analysis
Genomic & Molecular Data
• Annotations
• Confounding factors– Batch effects [will happen]
– Ancestry [is very important]
Clinical Genomics: Scalable Analysis Across Thousands of Patients
Brady Bernard, PhD
October 4, 2015 5
Annotations
• Current, common, and updated:– Reference genome builds– Gene definitions– Software and annotation versions
• Mendelian inheritance errors
• Extremely heterozygous variants
• Missing calls
• Commonly mutated segments
• …
Batch effects: Methylation plate position
Batch effects: Example variant associated with PTB
FTBn=401
195%
PTBn=198
4523%
p = 2e-11
Batch effects: Example variant associated with PTB
A
A
2.0.2.26 2.0.2.26, same pattern in ISB in‐house CGI genomes
Samples ordered by date
100 200 300 400 500 600
variant
variant
preterm
preterm
admixture
admixture
BATCH PTB
Graphic summary Ancestry and population stratification
Clinical Genomics: Scalable Analysis Across Thousands of Patients
Brady Bernard, PhD
October 4, 2015 6
Ancestry and population stratification
Population associated variants and class imbalance lead to likely false positives
Family-based genomic study design
• Transmission disequilibrium– Identify variants that are transmitted to effected offspring more
frequently than expected by chance
• Advantage– Accounts for population stratification
• Larger pedigrees can be helpful, though phenotypic and genomic data may not be available
Family genomics: phasing and candidate genes
Roach et al. (2010). Science
Family genomics: phasing and candidate genes
Roach et al. (2010). Science
Mendelian Inheritance Errors (MIEs) Mendelian Inheritance Errors (MIEs)
• Can be real de novo mutations• Most likely explanation is sequencing error
– MIEs, while infrequent, are observed orders of magnitude more than the expected de novo mutation rate
Clinical Genomics: Scalable Analysis Across Thousands of Patients
Brady Bernard, PhD
October 4, 2015 7
Accuracy and Variant call quality
99.95
99.96
99.97
99.98
99.99
100
0 20 40 60 80 100 120 140 160 180 200
Percent accuracy (100 ‐%MIE)
Variant call quality score
• Less than 0.05% of all calls are MIE
• Less then 0.002% MIE above quality score of 80
• With family trios, sequencing errors can be identified and spurious associations/type 1 error is mitigated, enabling utility of whole genome sequences in the clinical setting
Genomic & Molecular Data Recommendations
• Clinical analysis can inform genomic study design– Phenotype definition and prioritization is critical
• Matched and balanced cases and controls– minimize batch effects– improve statistical strength– mitigate confounding factors
• Larger segregating pedigrees reduce number of candidate genes –potentially hybrid approaches
• Run nuclear families (and multigenerational, if possible) together on the same batch
• Quality Control• Re-run‘controls’ across batches• Maintain highly detailed annotations on dates, reagents, sequencing runs, software
versions, tissues, …• External data sets from data provider to mitigate batch effects• SNP arrays for sample identifiability and case/control matching
Additional data considerations
• Staging and ‘production’ environments– As data is being generated, upload to staging environment then
QC and structure for analysis/consumption
• Data freezes– Create common data sets, annotation pipelines, and files for
collaborative analysis
• SNP arrays for sample identifiability and case/control matching, especially as number of data types and source sites increases
• De-identification, HIPAA, and Universally unique identifiers (UUIDs)
Discussion topics
• Clinical
• Genomic & Molecular
• Computational / Informatics / Analysis
Computing and collaborative projects
• Cloud computing– Consolidate data to a centralized source– Scalable computing– Data backup– Minimize IT
• Workflow management systems
• Web portals
NCI cancer cloud pilot
Clinical Genomics: Scalable Analysis Across Thousands of Patients
Brady Bernard, PhD
October 4, 2015 8
NCI cancer cloud pilotScalable Genomics Technologies and Architecture
• Quality Control• Re‐process• Batch Assessment• Data Normalization• Organize/Structure
7,000 genomes
Distributed Databases
Google Nearline
Google Genomics / BigQuery
Archival storage100GB/subject Reads (BAM)2GB/subject Variants (VCF)
*Petabyte‐scale data
Google Compute Engine
Annotations
Google Standard
+Billions of Unique VariantsTerabytes of Data (High Access)
Parallel Compute/Analysis
Staging File Server
Bioinformatic Pipelines
• Public (significant ETL and data modeling)
• ISB Proprietary (aggregated over thousands of ‘control’ sequences)
In‐house cluster
Open source workflow management systems
• Assist with provenance, data access, analysis, complex workflows, reproducible science
Web portals
• Project summaries, reports, auditing
• Data access
• Research & dissemination– Interactive exploration & dynamic analysis
Clinical Genomics: Scalable Analysis Across Thousands of Patients
Brady Bernard, PhD
October 4, 2015 9
https://itmi.systemsbiology.net/ptb/
Additional computational considerations
• Security/authentication/access controls
• Wikis/project pages (e.g., Confluence) and sub-teams (e.g., analysis working groups)
• Issue and project tracking (e.g., JIRA)
• Listservs and management
• Code repositories (e.g., GitHub)
Discussion topics
• Clinical
• Genomic & Molecular
• Computational / Informatics / Analysis
Concluding thoughts
• There are many potential [and avoidable] pitfalls
• Infrastructure required to establish and support ‘Big Science’ consortium projects is significant and easy to underestimate– Roles and responsibilities, maintenance, costs, support,
data generation, QC, QA, …
• With TCGA and ITMI, questions to be addressed with the data far exceeds the bandwidth of direct participants– significant value to community in curated clinical, genomic,
and molecular data– what will consortium (and IRB?) guidelines be for access
control and use by the broader community
2bnh: alpha‐beta horseshoe 1hv9: left‐handed beta helix 1m30: SH3‐like barrel