(hls402) getting into your genes: the definitive guide to using amazon emr, amazon elasticache, and...

64
Puneet Suri, Thermo Fisher Scientific Shakila Pothini, Thermo Fisher Scientific Sami Zuhuruddin, Amazon Web Services November 12, 2014 | Las Vegas, NV HLS402 Getting into Your Genes: The definitive guide to using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High- Performance Scientific Applications

Upload: amazon-web-services

Post on 02-Jul-2015

670 views

Category:

Technology


1 download

DESCRIPTION

The key to fighting cancer through better therapeutics is a deep understanding of the basic biology of this disease at a cellular and molecular level. Comprehensive analysis of cancer mutations in specific tumors or cancer cell lines by using Life Technologies sequencing and real-time PCR systems generates gigabytes to terabytes of data every day. Our customers bring together this data in studies that seek to discover the genetic fingerprint of cancer. The data typically translates to millions of records in databases that require complex algorithmic processing, cross-application analysis, and interactive visualizations with real-time response (2-3 seconds) to enable users to consume large volumes of complex scientific information. We have chosen the AWS platform to bring this new era of data analysis power to our customers by using technologies such as Amazon S3, ElastiCache, and DynamoDB for storage and fast access and Amazon EMR for parallelizing complex computations. Our talk tells the story with rich details about challenges and roadblocks in building data-intense, highly interactive applications in the cloud. We also highlight enhanced customer workflows and highly optimized applications with orders of magnitude improvement in performance and scalability.

TRANSCRIPT

Page 1: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

Puneet Suri, Thermo Fisher Scientific

Shakila Pothini, Thermo Fisher Scientific

Sami Zuhuruddin, Amazon Web Services

November 12, 2014 | Las Vegas, NV

HLS402

Getting into Your Genes: The definitive guide to using Amazon

EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-

Performance Scientific Applications

Page 2: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

About me

Puneet Suri

Senior Director, Software Engineering

Life Sciences Group, Thermo Fisher Scientific

follow at: @psuri connect at: [email protected]

Envisioned and developed the life sciences cloud platform for Thermo Fisher Scientific

Page 3: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

This is why we are here…

Page 4: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

Having an impact…

A person was set free

after 35 years in prison

because of a DNA test

Freeing the innocent

Surviving Cancer

A person survived pancreatic

cancer thanks to a genetic

approach that allowed an oncologist

to focus on a specific cancer cell

Ebola

Page 5: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

H1N1: Pandemic declared in April 2009

Page 6: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

Need to enable this at

larger scale & impact more lives

Page 7: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

Customer needs…

store & manage large scientific data sets

Page 8: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

A few years back

Page 9: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

Our offerings

desktop applications

challenges with upgrade cycle, versions etc.

limited storage and compute capacity

to analyze complex & large data sets

no sharing & collaboration

no backup, archive & security

Page 10: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

A better way… is to provide

STORAGE

COMPUTE

SCALABILITY

MEMORY

Page 11: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

Our vision

Page 12: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

A deep dive into our story

Page 13: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

A day with the scientist

Get Insights

a project

* * * *

Page 14: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

Insights…

• what is causing cancer

• what drugs will work

• is therapy working

Page 15: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

Customer pain points

• existing solutions cannot address the complexities

• excel is used painfully to manually analyze data

• multiple tools used to get the final insight

• it takes days to analyze the data

• some of the analysis workflow are not possible

Page 16: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

Dimensions of complexity…

millions

of

records

thousands

of users,

projects

real time

analysis

of large

datasets

2-3

seconds

response

time

project

storage compute performance scalability

Page 17: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

Our journey enabling complex customer workflows

Page 18: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

Our iterative journey & challenges

0 start with reference architecture

1 identify scalable storage solution

2 identify scalable storage solution for large data items

3 identify solutions for real time response & queries

4 Identify solutions for real time analysis of data

Page 19: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

Reference web architecture

A

B

User

Client

Internet

DNS Routing :

Route 53

AUTO SCALING

WEB SERVERS

AUTO SCALING

APP SERVERS Amazon

RDS MASTER

Amazon

RDS STANDBY

Synchro

nous R

eplic

ation

Load

BalancersLoad Balancers

WEB SERVERS

CDN:

CloudFront

APP SERVERS

Page 20: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

Why relational DB was not considered

• based on projected data and user growth over the years (hundreds

of TBs), required real-time query performance very hard to achieve

• needed managed scalability without sharding/re-sharding overhead

and disruptions

• needed a loose schema to seamlessly enable new and cross

domain workflows

Page 21: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

Our iterative journey & challenges

0 start with reference architecture

1 identify scalable storage solution

2 identify scalable storage solution for large data items

3 identify solutions for real time response & queries

4 Identify solutions for real time analysis of data

Page 22: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

NoSQL was the way to go

• managed scalability

• near zero administration overhead

• query performance not impacted by table size

can add billions of rows

• simple and flexible schema – new domains can be

supported

• extremely fast read/write performance

Page 23: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

Architecture with DynamoDB

A

B

User

Client

Internet

DNS Routing :

Route 53

AUTO SCALING

WEB SERVERS

AUTO SCALING

APP SERVERS

Load

BalancersLoad Balancers

WEB SERVERS

CDN:

CloudFront

APP SERVERS

Auto Scaling

Amazon

DynamoDB

Page 24: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

What worked well with DynamoDB

Managed Service with flexible schema

Managed Scalability

Extremely fast access in order of

millisecondsREAD/WRITE

Page 25: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

Iteration 1

GBs

GBs

MBs

MBs

Instrument Run

(1000s)

Patient Samples

(1000s)

Genes

(1000s)

Raw Signals

(millions)

Analysis Results

(millions)

Storage Query

Performance ✔ ✔

Cost ✔ ✔

Get Insights

project

Page 26: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

What were the gaps

our item attribute (e.g. Instrument Run) size range > 400KB

(item attribute size limitation of 64KB 400KB)

hot hash key & batch size limitations• Adding thousands of related records (e.g. Raw Signals) with common hash

key (e.g. Instrument Run) can be slow (10s seconds)

• a large project can have ~ 1 million records (e.g. Raw Signals) that needs to

read & written

for a large project, high read/write capacity (1000s) was needed

(increased cost due to high READ/WRITE capacity needs)

Page 27: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

What we needed

A solution that

• can store huge number of related objects

• is cost effective to read/write large data sets

• has no limitations on batch size or item size

• ability to query into the large number of records

Page 28: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

Our iterative journey & challenges

0 start with reference architecture

1 identify scalable storage solution : DynamoDB

2 identify scalable storage solution for large data items

3 identify solutions for real time response & queries

4 Identify solutions for real time analysis of data

Page 29: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

Architecture with DynamoDB & S3

A

B

User

Client

Internet

DNS Routing :

Route 53

AUTO SCALING

WEB SERVERS

AUTO SCALING

APP SERVERS

Load

BalancersLoad Balancers

WEB SERVERS

CDN:

CloudFront

APP SERVERS

Auto Scaling

DynamoDB

Amazon S3

Page 30: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

MBs

MBs

GBs

Iteration 2

GBs

Instrument Run

(1000s)

Patient Samples

(1000s)

Genes

(1000s)

Raw Signals

(millions)

Analysis Results

(millions)

Storage Query

Performance ✔ ✔

Cost ✔ ✔

Get Insights

Page 31: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

Architecture with DynamoDB & S3

• DynamoDB was used to store small unrelated objects (KB)

• will grow to a large number (e.g. Data Files)

• Amazon S3 was used to store related larger objects (e.g.

Raw Signals & Analysis Results (GB))

• stored as single Amazon S3 object serialized using

google protobuf

• Amazon S3 was cost effective for storing huge objects

Page 32: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

Real time queries for complex visualizations

Page 33: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

What we needed

• complex visualizations requires Gigabytes of data to be

queried in 2-3 secs and presented to the user

• visualizations are very interactive that requires constant

update of data. Need quick read & writes

• support concurrent access without any degradation in

query performance

Page 34: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

Our iterative journey & challenges

0 start with reference architecture

1 identify scalable storage solution : DynamoDB

2identify scalable storage solution for large data items : DynamoDB +

Amazon S3

3 identify solutions for fast real time response & queries

3 Identify solutions for real time analysis of data

Page 35: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

Distributed in-memory storage was the way to go

read/writes have to be quick to enable fast response

times, reading & writing from Amazon S3 was not ideal.

• ElastiCache was used as IN-MEMORY storage on top of

DynamoDB & Amazon S3.

• all related serialized objects in Amazon S3 accessed by

customers is maintained in ElastiCache as individual

records

• Indexes created in DynamoDB based on the query pattern

so that data can be easily retrieved from ElastiCache

Page 36: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

Architecture with DynamoDB, Amazon S3 & ElastiCache

A

B

User

Client

Internet

DNS Routing :

Route 53

AUTO SCALING

WEB SERVERS

AUTO SCALING

APP SERVERS

Load

BalancersLoad Balancers

WEB SERVERS

CDN:

CloudFront

APP SERVERS

Auto Scaling

DynamoDB

Amazon S3

ElastiCache

Page 37: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

Iteration 3

MBs

MBs

GBs

GBs

Instrument Run

(1000s)

Patient Samples

(1000s)

Genes

(1000s)

Raw Signals

(millions)

Analysis Results

(millions)

Storage Query

Performance ✔ ✔

Cost ✔ ✔

indexes

Get Insights

Page 38: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

Need for real time data analysis

• analyze huge projects containing thousands of patient samples in

minutes instead of days

• a scalable solution is required to support analysis requests from

thousands of users

• existing desktop algorithms used for this analysis not optimized for

extracting parallelism in data

Page 39: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

820

40

80

120

200

320

0

50

100

150

200

250

300

350

90000 180000 270000 360000 450000 675000 900000

desktop

desktop

Analysis solutions in desktop

desktop

crashesmin

ute

s

# of records

Get Insights

Page 40: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

Excel nightmare

Page 41: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

Our iterative journey & challenges

0 start with reference architecture

1 identify scalable storage solution : DynamoDB

2identify scalable storage solution for large data items : DynamoDB +

Amazon S3

3identify solutions for fast real time response & queries : DynamoDB + Amazon

S3 + ElastiCache

4 Identify solutions for real time analysis of data

Page 42: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

Amazon EMR was the way to go1. EMR was used to perform real time analysis of huge data sets –

results in minutes instead of days

2. all small jobs analyzed in-memory while big ones are sent to Amazon

EMR.

3. existing algorithms overhauled to derive massive parallelism using

Hadoop map-reduce framework

4. as large datasets already in Amazon S3, used Amazon S3 for input

and output instead of HDFS – only intermediate map-reduce data in

HDFS

5. Amazon EMR cluster is created On-Demand and shutdown when done

Page 43: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

Architecture with EMR for real time

analysisA

B

User

Client

Internet

DNS Routing :

Route 53

AUTO SCALING

WEB SERVERS

AUTO SCALING

APP SERVERS

Load

BalancersLoad Balancers

WEB SERVERS

CDN:

CloudFront

APP SERVERS

Auto Scaling

DynamoDB

Amazon S3

ElastiCache

EMR

Page 44: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

Iteration 4

MBs

MBs

GBs

GBs

Instrument Run

(1000s)

Patient Samples

(1000s)

Genes

(1000s)

Raw Signals

(millions)

Analysis Results

(millions)

Storage Query Analysis

Performance ✔ ✔ ✔

Cost ✔ ✔ ✔

Get Insights

Page 45: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

Performance for a project

2 4 7 11 13 2030

0

50

100

150

200

250

300

350

90000 180000 270000 360000 450000 675000 900000

cloud

desktop>10x

crashes

min

ute

s

# of records

Page 46: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

Journey

0 start with reference architecture

1 identify scalable storage solution : DynamoDB

2identify scalable storage solution for large data items : DynamoDB +

Amazon S3

3identify solutions for fast real time response & queries : DynamoDB + Amazon

S3 + ElastiCache

4 Identify solutions for real time analysis of data : Amazon EMR

Page 47: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

Learnings

Page 48: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

About me : Shakila Pothini

Senior Manger, Cloud Apps

Life Sciences Group,

Thermo Fisher Scientific

Hiking is my ONLY stress

buster

Entertain to Educate.

Cofounder of performing arts

group (swaram.org)

Mostly left brained with

occasional sense of

creativity

*

*

*

Page 49: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

How to get into your gene?

sequence the

human entire

transcriptome

(30,000

genes) identify

significant

genes

(100+ genes)

validate &

reconfirm the

(20+ genes)do it on more

samples &

different

population

find the way

the genes

interplay in

the pathway

understand

cancer

diversity.

types of

therapy.

drug-able

genes.

Page 50: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

Demo

Page 51: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

Demo summary

non cancerous sample cancerous sample

difference in

expression of

genes

Page 52: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

Customer feedback

“My initial SymphoniSuite evaluation

experience was good, GUI/ controls

are intuitive and data upload/

analysis was fast and user friendly”

UPENN

“I enjoy processing hundreds of open

array plates with ease.”, “I appreciate

the rapid access of the large number

of amplification curves ”

Sanofi

“I wanted to let you know that

Symphoni has been working well for

me. I have done analysis using as

high as 500 files. ”

ASU

“This I see value in... utilizing these

features. I appreciate the speed of data

processing and visuals.”

LUMC

Page 53: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

Yearly checkup today

165 / 105

120 50 / 90

104

Page 54: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

Is this really going to detect

early stages of cancer?

Page 55: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

A few years from now : every person

ATGCATGCTATCAATTGCCCSequence

melanoma

health risksdrug responsepowered by AWS

lifecloud

Page 56: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

Yearly check-ups a few years from now

ATGCATGC ATTGCCC

ATGCATGC ATTGCCCTATCA

GCATG

lifecloud

ATGCATGCTATCAATTGCCC

Sequence

Page 57: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

Yearly check-ups a few years from now

(cont’d)

cancer

any clinical

trial?

health risksdrug response

ATGCATGCTATCAATTGCCC

Sequence

lifecloud

prescribe the

right drug

Page 58: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

Puneet Suri

Senior Director, Software engineering

[email protected] T: 650.266.5857 @psuri

Shakila Pothini

Senior Manager, Cloud Applications

[email protected]

Salil Kumar

Cloud Architect

T: 650.740.1646 @salilkum

Page 59: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014
Page 60: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014
Page 61: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

Collect /

Ingest

Kinesis

Process / Analyze

EMR

EC2

Redshift

Data Pipeline

Visualize /

Report

Glacier

S3

DynamoDB

Store

RDS

Data Answers

Page 62: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

Experiment 1

Data Access

Compute Time

Experiment 2 Experiment 3

Data Access

Compute Time

Data Access

Compute Time

✔✔✔

Page 63: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

EMR

Cluster

EC2

Instance Data

Te

mp

era

ture

Page 64: (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

Please give us your feedback on this session.

Complete session evaluations and earn re:Invent swag.

http://bit.ly/awsevals

Puneet Suri

Senior Director, Software engineering

[email protected]

T: 650.266.5857 @psuri

Shakila Pothini

Senior Manager, Cloud Applications

[email protected]

T: 650.554.2190

Salil Kumar

Cloud Architect

T: 650.740.1646 @salilkum