the search for cancer's causes and cures
DESCRIPTION
Elastic{on} 2015 talk on the use of Elasticsearch to index genetic variation in cancer.TRANSCRIPT
-
The Search for Cancers Causes and Cures
Wade L. Schulz, MD, PhD
Yale University, Department of Laboratory Medicine
-
{ } CC-BY-ND 4.0
Cancer Statistics An Improving Outlook?
{ 1 }
0
100
200
300
400
500
600
Rat
e p
er 1
00,0
00
Incidence Mortality
-
{ } CC-BY-ND 4.0 { 2 }
Precision Medicine
Tailoring medical therapy to a particular patients characteristics
-
{ } CC-BY-ND 4.0
Presentation to Precision Care
{ 3 }
Images adapted from Servier Medical Art, CC-BY
-
{ } CC-BY-ND 4.0
When Cells Go Bad
{ 4 }
-
{ } CC-BY-ND 4.0
Genetics in 60 Seconds
{ 5 }
-
{ } CC-BY-ND 4.0
Genetics in 60 Seconds
{ 6 }
-
{ } CC-BY-ND 4.0
Searching for Mutations
{ 7 }
Gels and Capillaries
-
{ } CC-BY-ND 4.0
Next Generation Sequencing
{ 8 }
Massively Parallel
-
{ } CC-BY-ND 4.0
NGS The Technology
{ 9 }
-
{ } CC-BY-ND 4.0
$1
$10
$100
$1,000
$10,000
$100,000
$1,000,000
$10,000,000
$100,000,000Se
p-0
1
Jan
-02
May
-02
Sep
-02
Jan
-03
May
-03
Sep
-03
Jan
-04
May
-04
Sep
-04
Jan
-05
May
-05
Sep
-05
Jan
-06
May
-06
Sep
-06
Jan
-07
May
-07
Sep
-07
Jan
-08
May
-08
Sep
-08
Jan
-09
May
-09
Sep
-09
Jan
-10
May
-10
Sep
-10
Jan
-11
May
-11
Sep
-11
Jan
-12
May
-12
Sep
-12
Jan
-13
May
-13
Sep
-13
Jan
-14
May
-14
Moore's Law Cost per Genome
Cost of Sequencing
{ 10 }
-
{ } CC-BY-ND 4.0
Bases to Bytes
23 chromosomes 21,000 genes 3,300,000,000 base pairs
3.3e9 bases X 2 bits 825 MB/sequence
With metadata: 150 GB/sequence
3,000,000 variants/genome
{ 11 }
How big is the genome?
-
{ } CC-BY-ND 4.0
What are the Problems?
Constantly evolving data schema
Ability to integrate diverse data silos
Rapidly increasing needs for data storage
Need for easy, flexible analysis
{ 12 }
-
{ } CC-BY-ND 4.0
Why Elasticsearch?
- Rapid on-premise and cloud installations
- Dynamic schema that supported clinical results and annotation data
- Availability of libraries for multiple languages (NEST, elasticsearch-py)
- Tool availability (Kibana, Shield)
Its great!
{ 13 }
-
{ } CC-BY-ND 4.0
Sequencing and Interpretation Pipeline
{ 14 }
Gene Sequencing
Sequence Alignment
Quality Assurance
Variant Annotation
Clinical Interpretation
Clinical Trial Eligibility
ResearchManagement
{galileo} {kepler}
{galileo} {galileo/kepler}{galileo/kepler}
-
{ } CC-BY-ND 4.0
Whats in a Variant?
{ 15 }
60G6V:01053:03044 16 chr1 161383 0 16M * 0 0 TTTGCCAGAAAGCAAG
)///7;;6*669:1:5 ZP:B:f,0.00279573,0.0054005,2.19516e-07
ZM:B:s,244,0,242,0,0,242,2,270,494,300,0,248,36,0,0,0,272,0,204,272,398,248,246,268,270,0,0,0,302,0,0,0,550,
38,44,194,14,32,204,2,666,212,222,494,2,2,238,630,92,220,4,102,438,2,60,384,2,76,2,2,294,394,34 ZF:i:28
RG:Z:60G6V. PG:Z:tmap MD:Z:16 NM:i:0AS:i:16 XA:Z:map4-1 XS:i:16
60G6V:00605:00113 0 chr1 415215 2 8M5I31M3S * 0 0
CCAGCCTGGGTGCGTGACAGAGCAAGACTCCGTCTAAAAAGAAAGGT
B
-
{ } CC-BY-ND 4.0
Whats in a Variant?
{ 16 }
{ "chromosome": "chr7", "position": 148506396, "type": "snv", "refAllele": "A", "altAllele": "C", "totalReads": 1998, "forwardReads": 1038, "forwardRefReads": 524, "forwardAltReads": 514, "reverseReads": 960, "reverseRefReads": 500, "reverseAltReads": 460, "refReads": 1024, "altReads": 974, "vaf": 48.749, "variantRegion": "intronic", "variantEffect": "", "snvEffect": "A>C", "gene": "EZH2
}
- Variant location in genome
- Nucleotide change
- Sequencing statistics
- Variant prevalence in specimen
- Variant coding/protein effects
-
{ } CC-BY-ND 4.0
{Elastic} Searching for Meaning
{ 17 }
AzureElasticsearch
Local SQL and Elasticsearch
OMIM
COSMIC
dbSNP
ClinVar
Public Databases
Sequencers Variant AnalysisEffect Prediction
Public Variant Data
Private Variant Data
-
{ } CC-BY-ND 4.0
{Elastic} Searching for Meaning
{ 18 }
OMIM
COSMIC
dbSNP
ClinVar
Public Databases
Sequencers Variant AnalysisEffect Prediction
Public Variant Data
Private Variant Data
MVC Application(NEST)
-
{ } CC-BY-ND 4.0
Kibana Drilldown
{ 19 }
Rapid population stats
Physicians/researchers can quickly analyze data
Integration with health record
Demographics
Laboratory testing
Comorbidities
Treatment information
-
{ } CC-BY-ND 4.0
Kibana Drilldown
{ 20 }
-
{ } CC-BY-ND 4.0
Service Integration
{ 21 }
Predictive Algorithms
Quality Assurance-3
-2
-1
0
1
2
3
Variant Database
Clinical Interpretation
System
Web Service
Interfaces
Custom Validation
Scripts
Third-Party
Data Analysis
Software
-
{ } CC-BY-ND 4.0
Data Sharing
{ 22 }
Variant Database
Clinical Interpretation
System
Web Service
Interfaces
-
{ } CC-BY-ND 4.0
Conclusions
- Genetic sequencing and
clinical consultation complete
within one week of biopsy
- Integrated multiple analysis
pipelines for clinical
interpretation and research
applications
- Frequently identify patients
eligible for clinical trials
Clinical implications
- Two Elasticsearch clusters
- Over 60 million variant
annotations
- Nearly 10 million documents
related to cancer-associated
mutations
- Kibana and custom web
applications using NEST for
data visualization
System statistics
{ 24 }
-
{ }
Thank you!
Wade L. Schulz, MD, PhD
http://www.wadeschulz.com
Many images adapted from Servier Medical Art, CC-BY
Henry Rinder MD, Richard Torres MD, Christopher Tormey MD, Brian Smith MD, John Howe PhD,
Karl Hager PhD, Rodion Rathbone MD, Nathaniel Price, Alexa Siddon MD
-
{ } CC-BY-ND 4.0
This work is licensed under the Creative Commons
Attribution-NoDerivatives 4.0 International License.
To view a copy of this license, visit:
http://creativecommons.org/licenses/by-nd/4.0/
or send a letter to:
Creative Commons
PO Box 1866
Mountain View, CA 94042
USA
{ 25 }