(hls402) getting into your genes: the definitive guide to using amazon emr, amazon elasticache, and...
DESCRIPTION
The key to fighting cancer through better therapeutics is a deep understanding of the basic biology of this disease at a cellular and molecular level. Comprehensive analysis of cancer mutations in specific tumors or cancer cell lines by using Life Technologies sequencing and real-time PCR systems generates gigabytes to terabytes of data every day. Our customers bring together this data in studies that seek to discover the genetic fingerprint of cancer. The data typically translates to millions of records in databases that require complex algorithmic processing, cross-application analysis, and interactive visualizations with real-time response (2-3 seconds) to enable users to consume large volumes of complex scientific information. We have chosen the AWS platform to bring this new era of data analysis power to our customers by using technologies such as Amazon S3, ElastiCache, and DynamoDB for storage and fast access and Amazon EMR for parallelizing complex computations. Our talk tells the story with rich details about challenges and roadblocks in building data-intense, highly interactive applications in the cloud. We also highlight enhanced customer workflows and highly optimized applications with orders of magnitude improvement in performance and scalability.TRANSCRIPT
Puneet Suri, Thermo Fisher Scientific
Shakila Pothini, Thermo Fisher Scientific
Sami Zuhuruddin, Amazon Web Services
November 12, 2014 | Las Vegas, NV
HLS402
Getting into Your Genes: The definitive guide to using Amazon
EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-
Performance Scientific Applications
About me
Puneet Suri
Senior Director, Software Engineering
Life Sciences Group, Thermo Fisher Scientific
follow at: @psuri connect at: [email protected]
Envisioned and developed the life sciences cloud platform for Thermo Fisher Scientific
This is why we are here…
Having an impact…
A person was set free
after 35 years in prison
because of a DNA test
Freeing the innocent
Surviving Cancer
A person survived pancreatic
cancer thanks to a genetic
approach that allowed an oncologist
to focus on a specific cancer cell
Ebola
H1N1: Pandemic declared in April 2009
Need to enable this at
larger scale & impact more lives
Customer needs…
store & manage large scientific data sets
A few years back
Our offerings
desktop applications
challenges with upgrade cycle, versions etc.
limited storage and compute capacity
to analyze complex & large data sets
no sharing & collaboration
no backup, archive & security
A better way… is to provide
STORAGE
COMPUTE
SCALABILITY
MEMORY
Our vision
A deep dive into our story
A day with the scientist
Get Insights
a project
* * * *
Insights…
• what is causing cancer
• what drugs will work
• is therapy working
Customer pain points
• existing solutions cannot address the complexities
• excel is used painfully to manually analyze data
• multiple tools used to get the final insight
• it takes days to analyze the data
• some of the analysis workflow are not possible
Dimensions of complexity…
millions
of
records
thousands
of users,
projects
real time
analysis
of large
datasets
2-3
seconds
response
time
project
storage compute performance scalability
Our journey enabling complex customer workflows
Our iterative journey & challenges
0 start with reference architecture
1 identify scalable storage solution
2 identify scalable storage solution for large data items
3 identify solutions for real time response & queries
4 Identify solutions for real time analysis of data
Reference web architecture
A
B
User
Client
Internet
DNS Routing :
Route 53
AUTO SCALING
WEB SERVERS
AUTO SCALING
APP SERVERS Amazon
RDS MASTER
Amazon
RDS STANDBY
Synchro
nous R
eplic
ation
Load
BalancersLoad Balancers
WEB SERVERS
CDN:
CloudFront
APP SERVERS
Why relational DB was not considered
• based on projected data and user growth over the years (hundreds
of TBs), required real-time query performance very hard to achieve
• needed managed scalability without sharding/re-sharding overhead
and disruptions
• needed a loose schema to seamlessly enable new and cross
domain workflows
Our iterative journey & challenges
0 start with reference architecture
1 identify scalable storage solution
2 identify scalable storage solution for large data items
3 identify solutions for real time response & queries
4 Identify solutions for real time analysis of data
NoSQL was the way to go
• managed scalability
• near zero administration overhead
• query performance not impacted by table size
can add billions of rows
• simple and flexible schema – new domains can be
supported
• extremely fast read/write performance
Architecture with DynamoDB
A
B
User
Client
Internet
DNS Routing :
Route 53
AUTO SCALING
WEB SERVERS
AUTO SCALING
APP SERVERS
Load
BalancersLoad Balancers
WEB SERVERS
CDN:
CloudFront
APP SERVERS
Auto Scaling
Amazon
DynamoDB
What worked well with DynamoDB
Managed Service with flexible schema
Managed Scalability
Extremely fast access in order of
millisecondsREAD/WRITE
Iteration 1
GBs
GBs
MBs
MBs
Instrument Run
(1000s)
Patient Samples
(1000s)
Genes
(1000s)
Raw Signals
(millions)
Analysis Results
(millions)
Storage Query
Performance ✔ ✔
Cost ✔ ✔
Get Insights
project
What were the gaps
our item attribute (e.g. Instrument Run) size range > 400KB
(item attribute size limitation of 64KB 400KB)
hot hash key & batch size limitations• Adding thousands of related records (e.g. Raw Signals) with common hash
key (e.g. Instrument Run) can be slow (10s seconds)
• a large project can have ~ 1 million records (e.g. Raw Signals) that needs to
read & written
for a large project, high read/write capacity (1000s) was needed
(increased cost due to high READ/WRITE capacity needs)
What we needed
A solution that
• can store huge number of related objects
• is cost effective to read/write large data sets
• has no limitations on batch size or item size
• ability to query into the large number of records
Our iterative journey & challenges
0 start with reference architecture
1 identify scalable storage solution : DynamoDB
2 identify scalable storage solution for large data items
3 identify solutions for real time response & queries
4 Identify solutions for real time analysis of data
Architecture with DynamoDB & S3
A
B
User
Client
Internet
DNS Routing :
Route 53
AUTO SCALING
WEB SERVERS
AUTO SCALING
APP SERVERS
Load
BalancersLoad Balancers
WEB SERVERS
CDN:
CloudFront
APP SERVERS
Auto Scaling
DynamoDB
Amazon S3
MBs
MBs
GBs
Iteration 2
GBs
Instrument Run
(1000s)
Patient Samples
(1000s)
Genes
(1000s)
Raw Signals
(millions)
Analysis Results
(millions)
Storage Query
Performance ✔ ✔
Cost ✔ ✔
Get Insights
Architecture with DynamoDB & S3
• DynamoDB was used to store small unrelated objects (KB)
• will grow to a large number (e.g. Data Files)
• Amazon S3 was used to store related larger objects (e.g.
Raw Signals & Analysis Results (GB))
• stored as single Amazon S3 object serialized using
google protobuf
• Amazon S3 was cost effective for storing huge objects
Real time queries for complex visualizations
What we needed
• complex visualizations requires Gigabytes of data to be
queried in 2-3 secs and presented to the user
• visualizations are very interactive that requires constant
update of data. Need quick read & writes
• support concurrent access without any degradation in
query performance
Our iterative journey & challenges
0 start with reference architecture
1 identify scalable storage solution : DynamoDB
2identify scalable storage solution for large data items : DynamoDB +
Amazon S3
3 identify solutions for fast real time response & queries
3 Identify solutions for real time analysis of data
Distributed in-memory storage was the way to go
read/writes have to be quick to enable fast response
times, reading & writing from Amazon S3 was not ideal.
• ElastiCache was used as IN-MEMORY storage on top of
DynamoDB & Amazon S3.
• all related serialized objects in Amazon S3 accessed by
customers is maintained in ElastiCache as individual
records
• Indexes created in DynamoDB based on the query pattern
so that data can be easily retrieved from ElastiCache
Architecture with DynamoDB, Amazon S3 & ElastiCache
A
B
User
Client
Internet
DNS Routing :
Route 53
AUTO SCALING
WEB SERVERS
AUTO SCALING
APP SERVERS
Load
BalancersLoad Balancers
WEB SERVERS
CDN:
CloudFront
APP SERVERS
Auto Scaling
DynamoDB
Amazon S3
ElastiCache
Iteration 3
MBs
MBs
GBs
GBs
Instrument Run
(1000s)
Patient Samples
(1000s)
Genes
(1000s)
Raw Signals
(millions)
Analysis Results
(millions)
Storage Query
Performance ✔ ✔
Cost ✔ ✔
indexes
Get Insights
Need for real time data analysis
• analyze huge projects containing thousands of patient samples in
minutes instead of days
• a scalable solution is required to support analysis requests from
thousands of users
• existing desktop algorithms used for this analysis not optimized for
extracting parallelism in data
820
40
80
120
200
320
0
50
100
150
200
250
300
350
90000 180000 270000 360000 450000 675000 900000
desktop
desktop
Analysis solutions in desktop
desktop
crashesmin
ute
s
# of records
Get Insights
Excel nightmare
Our iterative journey & challenges
0 start with reference architecture
1 identify scalable storage solution : DynamoDB
2identify scalable storage solution for large data items : DynamoDB +
Amazon S3
3identify solutions for fast real time response & queries : DynamoDB + Amazon
S3 + ElastiCache
4 Identify solutions for real time analysis of data
Amazon EMR was the way to go1. EMR was used to perform real time analysis of huge data sets –
results in minutes instead of days
2. all small jobs analyzed in-memory while big ones are sent to Amazon
EMR.
3. existing algorithms overhauled to derive massive parallelism using
Hadoop map-reduce framework
4. as large datasets already in Amazon S3, used Amazon S3 for input
and output instead of HDFS – only intermediate map-reduce data in
HDFS
5. Amazon EMR cluster is created On-Demand and shutdown when done
Architecture with EMR for real time
analysisA
B
User
Client
Internet
DNS Routing :
Route 53
AUTO SCALING
WEB SERVERS
AUTO SCALING
APP SERVERS
Load
BalancersLoad Balancers
WEB SERVERS
CDN:
CloudFront
APP SERVERS
Auto Scaling
DynamoDB
Amazon S3
ElastiCache
EMR
Iteration 4
MBs
MBs
GBs
GBs
Instrument Run
(1000s)
Patient Samples
(1000s)
Genes
(1000s)
Raw Signals
(millions)
Analysis Results
(millions)
Storage Query Analysis
Performance ✔ ✔ ✔
Cost ✔ ✔ ✔
Get Insights
Performance for a project
2 4 7 11 13 2030
0
50
100
150
200
250
300
350
90000 180000 270000 360000 450000 675000 900000
cloud
desktop>10x
crashes
min
ute
s
# of records
Journey
0 start with reference architecture
1 identify scalable storage solution : DynamoDB
2identify scalable storage solution for large data items : DynamoDB +
Amazon S3
3identify solutions for fast real time response & queries : DynamoDB + Amazon
S3 + ElastiCache
4 Identify solutions for real time analysis of data : Amazon EMR
✓
✓
✓
✓
✓
Learnings
•
•
•
•
•
•
About me : Shakila Pothini
Senior Manger, Cloud Apps
Life Sciences Group,
Thermo Fisher Scientific
Hiking is my ONLY stress
buster
Entertain to Educate.
Cofounder of performing arts
group (swaram.org)
Mostly left brained with
occasional sense of
creativity
*
*
*
How to get into your gene?
sequence the
human entire
transcriptome
(30,000
genes) identify
significant
genes
(100+ genes)
validate &
reconfirm the
(20+ genes)do it on more
samples &
different
population
find the way
the genes
interplay in
the pathway
understand
cancer
diversity.
types of
therapy.
drug-able
genes.
Demo
Demo summary
non cancerous sample cancerous sample
difference in
expression of
genes
Customer feedback
“My initial SymphoniSuite evaluation
experience was good, GUI/ controls
are intuitive and data upload/
analysis was fast and user friendly”
UPENN
“I enjoy processing hundreds of open
array plates with ease.”, “I appreciate
the rapid access of the large number
of amplification curves ”
Sanofi
“I wanted to let you know that
Symphoni has been working well for
me. I have done analysis using as
high as 500 files. ”
ASU
“This I see value in... utilizing these
features. I appreciate the speed of data
processing and visuals.”
LUMC
Yearly checkup today
165 / 105
120 50 / 90
104
Is this really going to detect
early stages of cancer?
A few years from now : every person
ATGCATGCTATCAATTGCCCSequence
melanoma
health risksdrug responsepowered by AWS
lifecloud
Yearly check-ups a few years from now
ATGCATGC ATTGCCC
ATGCATGC ATTGCCCTATCA
GCATG
lifecloud
ATGCATGCTATCAATTGCCC
Sequence
Yearly check-ups a few years from now
(cont’d)
cancer
any clinical
trial?
health risksdrug response
ATGCATGCTATCAATTGCCC
Sequence
lifecloud
prescribe the
right drug
Puneet Suri
Senior Director, Software engineering
[email protected] T: 650.266.5857 @psuri
Shakila Pothini
Senior Manager, Cloud Applications
Salil Kumar
Cloud Architect
T: 650.740.1646 @salilkum
Collect /
Ingest
Kinesis
Process / Analyze
EMR
EC2
Redshift
Data Pipeline
Visualize /
Report
Glacier
S3
DynamoDB
Store
RDS
Data Answers
Experiment 1
Data Access
Compute Time
Experiment 2 Experiment 3
Data Access
Compute Time
Data Access
Compute Time
✔✔✔
EMR
Cluster
EC2
Instance Data
Te
mp
era
ture
Please give us your feedback on this session.
Complete session evaluations and earn re:Invent swag.
http://bit.ly/awsevals
Puneet Suri
Senior Director, Software engineering
T: 650.266.5857 @psuri
Shakila Pothini
Senior Manager, Cloud Applications
T: 650.554.2190
Salil Kumar
Cloud Architect
T: 650.740.1646 @salilkum