(hls402) getting into your genes: the definitive guide to using amazon emr, amazon elasticache, and...

Puneet Suri, Thermo Fisher Scientific

Shakila Pothini, Thermo Fisher Scientific

Sami Zuhuruddin, Amazon Web Services

November 12, 2014 | Las Vegas, NV

HLS402

Getting into Your Genes: The definitive guide to using Amazon

EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-

Performance Scientific Applications

About me

Puneet Suri

Senior Director, Software Engineering

Life Sciences Group, Thermo Fisher Scientific

follow at: @psuri connect at: [email protected]

Envisioned and developed the life sciences cloud platform for Thermo Fisher Scientific

This is why we are here…

Having an impact…

A person was set free

after 35 years in prison

because of a DNA test

Freeing the innocent

Surviving Cancer

A person survived pancreatic

cancer thanks to a genetic

approach that allowed an oncologist

to focus on a specific cancer cell

Ebola

H1N1: Pandemic declared in April 2009

Need to enable this at

larger scale & impact more lives

Customer needs…

store & manage large scientific data sets

A few years back

Our offerings

desktop applications

challenges with upgrade cycle, versions etc.

limited storage and compute capacity

to analyze complex & large data sets

no sharing & collaboration

no backup, archive & security

A better way… is to provide

STORAGE

COMPUTE

SCALABILITY

MEMORY

Our vision

A deep dive into our story

A day with the scientist

Get Insights

a project

* * * *

Insights…

• what is causing cancer

• what drugs will work

• is therapy working

Customer pain points

• existing solutions cannot address the complexities

• excel is used painfully to manually analyze data

• multiple tools used to get the final insight

• it takes days to analyze the data

• some of the analysis workflow are not possible

Dimensions of complexity…

millions

of

records

thousands

of users,

projects

real time

analysis

of large

datasets

2-3

seconds

response

time

project

storage compute performance scalability

Our journey enabling complex customer workflows

Our iterative journey & challenges

0 start with reference architecture

1 identify scalable storage solution

2 identify scalable storage solution for large data items

3 identify solutions for real time response & queries

4 Identify solutions for real time analysis of data

Reference web architecture

A

B

User

Client

Internet

DNS Routing :

Route 53

AUTO SCALING

WEB SERVERS

AUTO SCALING

APP SERVERS Amazon

RDS MASTER

Amazon

RDS STANDBY

Synchro

nous R

eplic

ation

Load

BalancersLoad Balancers

WEB SERVERS

CDN:

CloudFront

APP SERVERS

Why relational DB was not considered

• based on projected data and user growth over the years (hundreds

of TBs), required real-time query performance very hard to achieve

• needed managed scalability without sharding/re-sharding overhead

and disruptions

• needed a loose schema to seamlessly enable new and cross

domain workflows



1 identify scalable storage solution




NoSQL was the way to go

• managed scalability

• near zero administration overhead

• query performance not impacted by table size

can add billions of rows

• simple and flexible schema – new domains can be

supported

• extremely fast read/write performance

Architecture with DynamoDB

A

B

User

Client

Internet

DNS Routing :

Route 53

AUTO SCALING

WEB SERVERS

AUTO SCALING

APP SERVERS

Load


WEB SERVERS

CDN:

CloudFront

APP SERVERS

Auto Scaling

Amazon

DynamoDB

What worked well with DynamoDB

Managed Service with flexible schema

Managed Scalability

Extremely fast access in order of

millisecondsREAD/WRITE

Iteration 1

GBs

GBs

MBs

MBs

Instrument Run

(1000s)

Patient Samples

(1000s)

Genes

(1000s)

Raw Signals

(millions)

Analysis Results

(millions)

Storage Query

Performance ✔ ✔

Cost ✔ ✔

Get Insights

project

What were the gaps

our item attribute (e.g. Instrument Run) size range > 400KB

(item attribute size limitation of 64KB 400KB)

hot hash key & batch size limitations• Adding thousands of related records (e.g. Raw Signals) with common hash

key (e.g. Instrument Run) can be slow (10s seconds)

• a large project can have ~ 1 million records (e.g. Raw Signals) that needs to

read & written

for a large project, high read/write capacity (1000s) was needed

(increased cost due to high READ/WRITE capacity needs)

What we needed

A solution that

• can store huge number of related objects

• is cost effective to read/write large data sets

• has no limitations on batch size or item size

• ability to query into the large number of records



1 identify scalable storage solution : DynamoDB




Architecture with DynamoDB & S3

A

B

User

Client

Internet

DNS Routing :

Route 53

AUTO SCALING

WEB SERVERS

AUTO SCALING

APP SERVERS

Load


WEB SERVERS

CDN:

CloudFront

APP SERVERS

Auto Scaling

DynamoDB

Amazon S3

MBs

MBs

GBs

Iteration 2

GBs

Instrument Run

(1000s)

Patient Samples

(1000s)

Genes

(1000s)

Raw Signals

(millions)

Analysis Results

(millions)

Storage Query

Performance ✔ ✔

Cost ✔ ✔

Get Insights

Architecture with DynamoDB & S3

• DynamoDB was used to store small unrelated objects (KB)

• will grow to a large number (e.g. Data Files)

• Amazon S3 was used to store related larger objects (e.g.

Raw Signals & Analysis Results (GB))

• stored as single Amazon S3 object serialized using

google protobuf

• Amazon S3 was cost effective for storing huge objects

Real time queries for complex visualizations

What we needed

• complex visualizations requires Gigabytes of data to be

queried in 2-3 secs and presented to the user

• visualizations are very interactive that requires constant

update of data. Need quick read & writes

• support concurrent access without any degradation in

query performance




2identify scalable storage solution for large data items : DynamoDB +

Amazon S3

3 identify solutions for fast real time response & queries


Distributed in-memory storage was the way to go

read/writes have to be quick to enable fast response

times, reading & writing from Amazon S3 was not ideal.

• ElastiCache was used as IN-MEMORY storage on top of

DynamoDB & Amazon S3.

• all related serialized objects in Amazon S3 accessed by

customers is maintained in ElastiCache as individual

records

• Indexes created in DynamoDB based on the query pattern

so that data can be easily retrieved from ElastiCache

Architecture with DynamoDB, Amazon S3 & ElastiCache

A

B

User

Client

Internet

DNS Routing :

Route 53

AUTO SCALING

WEB SERVERS

AUTO SCALING

APP SERVERS

Load


WEB SERVERS

CDN:

CloudFront

APP SERVERS

Auto Scaling

DynamoDB

Amazon S3

ElastiCache

Iteration 3

MBs

MBs

GBs

GBs

Instrument Run

(1000s)

Patient Samples

(1000s)

Genes

(1000s)

Raw Signals

(millions)

Analysis Results

(millions)

Storage Query

Performance ✔ ✔

Cost ✔ ✔

indexes

Get Insights

Need for real time data analysis

• analyze huge projects containing thousands of patient samples in

minutes instead of days

• a scalable solution is required to support analysis requests from

thousands of users

• existing desktop algorithms used for this analysis not optimized for

extracting parallelism in data

820

40

80

120

200

320

0

50

100

150

200

250

300

350

90000 180000 270000 360000 450000 675000 900000

desktop

desktop

Analysis solutions in desktop

desktop

crashesmin

ute

s

# of records

Get Insights

Excel nightmare





Amazon S3

3identify solutions for fast real time response & queries : DynamoDB + Amazon

S3 + ElastiCache


Amazon EMR was the way to go1. EMR was used to perform real time analysis of huge data sets –

results in minutes instead of days

2. all small jobs analyzed in-memory while big ones are sent to Amazon

EMR.

3. existing algorithms overhauled to derive massive parallelism using

Hadoop map-reduce framework

4. as large datasets already in Amazon S3, used Amazon S3 for input

and output instead of HDFS – only intermediate map-reduce data in

HDFS

5. Amazon EMR cluster is created On-Demand and shutdown when done

Architecture with EMR for real time

analysisA

B

User

Client

Internet

DNS Routing :

Route 53

AUTO SCALING

WEB SERVERS

AUTO SCALING

APP SERVERS

Load


WEB SERVERS

CDN:

CloudFront

APP SERVERS

Auto Scaling

DynamoDB

Amazon S3

ElastiCache

EMR

Iteration 4

MBs

MBs

GBs

GBs

Instrument Run

(1000s)

Patient Samples

(1000s)

Genes

(1000s)

Raw Signals

(millions)

Analysis Results

(millions)

Storage Query Analysis

Performance ✔ ✔ ✔

Cost ✔ ✔ ✔

Get Insights

Performance for a project

2 4 7 11 13 2030

0

50

100

150

200

250

300

350

90000 180000 270000 360000 450000 675000 900000

cloud

desktop>10x

crashes

min

ute

s

# of records

Journey




Amazon S3

3identify solutions for fast real time response & queries : DynamoDB + Amazon

S3 + ElastiCache

4 Identify solutions for real time analysis of data : Amazon EMR

✓

✓

✓

✓

✓

Learnings

•

•

•

•

•

•

About me : Shakila Pothini

Senior Manger, Cloud Apps

Life Sciences Group,

Thermo Fisher Scientific

Hiking is my ONLY stress

buster

Entertain to Educate.

Cofounder of performing arts

group (swaram.org)

Mostly left brained with

occasional sense of

creativity

*

*

*

How to get into your gene?

sequence the

human entire

transcriptome

(30,000

genes) identify

significant

genes

(100+ genes)

validate &

reconfirm the

(20+ genes)do it on more

samples &

different

population

find the way

the genes

interplay in

the pathway

understand

cancer

diversity.

types of

therapy.

drug-able

genes.

Demo summary

non cancerous sample cancerous sample

difference in

expression of

genes

Customer feedback

“My initial SymphoniSuite evaluation

experience was good, GUI/ controls

are intuitive and data upload/

analysis was fast and user friendly”

UPENN

“I enjoy processing hundreds of open

array plates with ease.”, “I appreciate

the rapid access of the large number

of amplification curves ”

Sanofi

“I wanted to let you know that

Symphoni has been working well for

me. I have done analysis using as

high as 500 files. ”

ASU

“This I see value in... utilizing these

features. I appreciate the speed of data

processing and visuals.”

LUMC

Yearly checkup today

165 / 105

120 50 / 90

104

Is this really going to detect

early stages of cancer?

A few years from now : every person

ATGCATGCTATCAATTGCCCSequence

melanoma

health risksdrug responsepowered by AWS

lifecloud

Yearly check-ups a few years from now

ATGCATGC ATTGCCC

ATGCATGC ATTGCCCTATCA

GCATG

lifecloud

ATGCATGCTATCAATTGCCC

Sequence

Yearly check-ups a few years from now

(cont’d)

cancer

any clinical

trial?

health risksdrug response

ATGCATGCTATCAATTGCCC

Sequence

lifecloud

prescribe the

right drug

Puneet Suri

Senior Director, Software engineering

[email protected] T: 650.266.5857 @psuri

Shakila Pothini

Senior Manager, Cloud Applications

[email protected]

Salil Kumar

Cloud Architect

T: 650.740.1646 @salilkum

mailto:[email protected]


Collect /

Ingest

Kinesis

Process / Analyze

EMR

EC2

Redshift

Data Pipeline

Visualize /

Report

Glacier

S3

DynamoDB

Store

RDS

Data Answers

Experiment 1

Data Access

Compute Time

Experiment 2 Experiment 3

Data Access

Compute Time

Data Access

Compute Time

✔✔✔

EMR

Cluster

EC2

Instance Data

Te

mp

era

ture

Please give us your feedback on this session.

Complete session evaluations and earn re:Invent swag.

http://bit.ly/awsevals

Puneet Suri

Senior Director, Software engineering

[email protected]

T: 650.266.5857 @psuri

Shakila Pothini

Senior Manager, Cloud Applications

[email protected]

T: 650.554.2190

Salil Kumar

Cloud Architect

T: 650.740.1646 @salilkum

http://bit.ly/awsevals



(hls402) getting into your genes: the definitive guide to using amazon emr, amazon elasticache, and...

Technology