masterworks talk on big data and the implications of petascale science

Post on 13-Nov-2014

4.100 Views

Category:

Technology

4 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

TRANSCRIPT

Big  Data  and  Biology:  The  implica4ons  of  petascale  scienceDeepak  Singh

life science industry

Credit: Bosco Ho

By ~Prescott under a CC-BY-NC license

data

Image: Wikipedia

biology

big data

Source: http://www.nature.com/news/specials/bigdata/index.html

Image: Matt Wood

Human genome

Image: Matt Wood

not just sequencing

more data

Image: Matt Wood

all hell breaks loose

~100 TB/Week

~100 TB/Week

>2 PB/Year

years

weeks

days

days

days

minutes?

gigabytes

terabytes

petabytes

exabytes?

really fast

Image: http://www.broadinstitute.org/~apleite/photos.html

single lab

Image: Chris Dagdigian

implications of scale

data management

data processing

data sharing

fundamental concepts

1. architecting for scale

“Everything fails, all the time”-- Werner Vogels

“Things will crash. Deal with it”-- Jeff Dean

“Remember everything fails”-- Randy Shoup

fun with numbers

datacenter availability

Source: Uptime Institute

Tier  I:  28.8  hours  annual  down4me  (99.67%  availability)Tier  II:  22.0  hrs  annual  down4me  (99.75%  availability)Tier  III:  1.6  hrs  annual  down4me  (99.98%  availability)Tier  IV:  0.8  hrs  annual  down4me  (99.99%  availability)

Source: Uptime Institute

cooling systems go down

power units fail

2-4% of serverswill die annually

Source: Jeff Dean, LADIS 2009

1-5% of disk drives will die every year

Source: Jeff Dean, LADIS 2009

2.3% AFR in population of 13,2503.3% AFR in population of 22,400

4.2% AFR in population of 246,000

Source: James Hamilton

software breaks

human errors

human errors~20% admin issues have unintended consequences

Source: James Hamilton

achieving scalabilityand availability

partitioning

redundancy

recovery oriented computing

Source: http://perspectives.mvdirona.com/, http://roc.cs.berkeley.edu/

assume sw/hw failure

design apps to be resilient

automation

ComputeAmazon Elastic Compute

Cloud (EC2)- Elastic Load Balancing- Auto Scaling

StorageAmazon Simple

Storage Service (S3)- AWS Import/Export

Your Custom Applications and Services

Content DeliveryAmazon CloudFront

MessagingAmazon Simple

Queue Service (SQS)

PaymentsAmazon Flexible Payments Service

(FPS)

On-Demand Workforce

Amazon Mechanical Turk

Parallel ProcessingAmazon Elastic

MapReduce

MonitoringAmazon CloudWatch

ManagementAWS Management Console

ToolsAWS Toolkit for Eclipse

Isolated NetworksAmazon Virtual Private

Cloud

DatabaseAmazon RDS and

SimpleDB

Amazon S3

durable

available

!"#$%&'()*+

T

TT

Amazon EC2

highly scalable

3000 CPU’s for one firm’s risk management application

!"#$%&'()'*+,'-./01.2%/'

344'+567/'(.'

8%%9%.:/'

;<"&/:1='

>?,3?,44@'

A&B:1='

>?,>?,44@'

C".:1='

>?,D?,44@'

E(.:1='

>?,F?,44@'

;"%/:1='

>?,G?,44@'

C10"&:1='

>?,H?,44@'

I%:.%/:1='

>?,,?,44@'

3444JJ'

344'JJ'

highly available systems

dynamic

fault tolerant

US East Region

Availability Zone A

Availability Zone B

Availability Zone C

Availability Zone D

2. one size does not fit all

2. one size does not fit all^data

many data types

structured data

using the right data store

(a) feature first

RDBMS

Oracle, SQL Server, DB2, MySQL, Postgres

use a bigger computer

remove joins

scaling limits

(b) scale first

scale is highest priority

single RDBMS incapable

solution 1: data sharding

10’s

100’s

solution 2: scalable key-value store

scale is design point

MongoDB, Project Voldermort, Cassandra, HBase, BigTable, Amazon SimpleDB, Dynamo

(c) simple structured storage

simplefast

low ops cost

BerkeleyDB, Tokyo Cabinet, Amazon SimpleDB

(d) purpose optimized stores

data warehousingstream processing

Aster Data, Vertica, Netezza, Greenplum, VoltDB, StreamBase

what about files?

cluster file systems

Lustre, GlusterFS

distributed file systems

HDFS, GFS

distributed object store

Amazon S3, Dynomite

ComputeAmazon Elastic Compute

Cloud (EC2)- Elastic Load Balancing- Auto Scaling

StorageAmazon Simple

Storage Service (S3)- AWS Import/Export

Your Custom Applications and Services

Content DeliveryAmazon CloudFront

MessagingAmazon Simple

Queue Service (SQS)

PaymentsAmazon Flexible Payments Service

(FPS)

On-Demand Workforce

Amazon Mechanical Turk

Parallel ProcessingAmazon Elastic

MapReduce

MonitoringAmazon CloudWatch

ManagementAWS Management Console

ToolsAWS Toolkit for Eclipse

Isolated NetworksAmazon Virtual Private

Cloud

DatabaseAmazon RDS and

SimpleDB

ComputeAmazon Elastic Compute

Cloud (EC2)- Elastic Load Balancing- Auto Scaling

StorageAmazon Simple

Storage Service (S3)- AWS Import/Export

Your Custom Applications and Services

Content DeliveryAmazon CloudFront

MessagingAmazon Simple

Queue Service (SQS)

PaymentsAmazon Flexible Payments Service

(FPS)

On-Demand Workforce

Amazon Mechanical Turk

Parallel ProcessingAmazon Elastic

MapReduce

MonitoringAmazon CloudWatch

ManagementAWS Management Console

ToolsAWS Toolkit for Eclipse

Isolated NetworksAmazon Virtual Private

Cloud

DatabaseAmazon RDS and

SimpleDB

3. processing big data

disk read/writesslow & expensive

data processingfast & cheap

distribute the dataparallel reads

data processing for the cloud

distributed file system(HDFS)

map/reduce

Via Cloudera under a Creative Commons License

Via Cloudera under a Creative Commons License

fault tolerance

massive scalability

petabyte scale

hosted hadoop service

hadoop easy and simple

Input  S3  bucket

Output  S3  bucket

Amazon S3

Hadoop

Amazon EC2 Instances

Input dataset

outputresults

Deploy Application

Web Console, Command line tools

End

Notify

Get ResultsInput Data

Amazon Elastic MapReduce

Hadoop Hadoop

Hadoop

Hadoop

Hadoop

Elastic MapReduce

Elastic MapReduce

back to the science

basic informatics workflow

Via Argonne National Labs under a CC-BY-SA license

Via Argonne National Labs under a CC-BY-SA license

killer app

getting the data

Register projects

Register samples

Sample prep

Sequencing

Analysis

These slides cover work presented by Matt Wood at various conferences

Image: Matt Wood

constant change

flexible data capture

virtual fields

no schema

specify at run time

specify at run time(bootstrapping)

Sample

Name

Organism

Concentration

Source: Matt Wood

Source: Matt Wood

key value pairs

change happens

Sample

Name

Organism

Concentration

Sample

Name

Organism

Concentration

Origin

Quality metric

V1 V2

Source: Matt Wood

Source: Matt Wood

high throughput

lots of pipelines

scaling projects/pipelines?

lots of apps

loosely coupled

automation

scale operationally

be agile

now what?

Via Argonne National Labs under a CC-BY-SA license

many data types

changing data types

Shaq Image: Keith Allison under a CC-BY-SA license

Shaq Image: Keith Allison under a CC-BY-SA license

Shaq Image: Keith Allison under a CC-BY-SA license

Shaq Image: Keith Allison under a CC-BY-SA license

Shaq Image: Keith Allison under a CC-BY-SA license

?

lots and lots and lots and lots and lots and lots of data andlots and lots of lots of data

By bitterlysweet under a CC-BY-NC-ND license

Source: http://bit.ly/anderson-bigdata

Chris Anderson doesn’t understand science

“more is different”

few data points

elaborate models

the unreasonable effectiveness of data

Source: “The Unreasonable Effectiveness of Data”, Alon Halevy, Peter Norvig, and Fernando Pereira

simple modelslots of data

information platform

information platforms at scale

one organization

4 TB daily added(compressed)

135 TB data scanned daily(compressed)

15 PB data total capacity

???

Facebook data from Ashish Thusoo’s HadoopWorld 2009 talk

not always that big

can we learn any lessons?

Source: “Information Platforms and the Rise of the Data Scientist”, Jeff Hammerbacher in Beautiful Data

analytics platform

Data warehouse

Data warehouse is a repository of anorganization's electronically stored data. Data warehouses are designed to facilitate reporting and analysis

ETL

extract

transform

load

1 TB

MySQL --> Oracle

more data

more data types

changing data types

limit data warehouse

too limited

how do you scale and adapt?

100’s of TBs

1000’s of jobs

back to the science

back in the day

small data sets

flat files

../../folder1/ ../folder2/

file1file2..fileN

../folderN/.. .

shared file system

RDBMS

Image: Wikimedia Commons

Image: Chris Dagdigian

need to process

need to analyze

100’s of TBs

1000’s of jobs

Facebook data from Ashish Thusoo’s HadoopWorld 2009 talk

ETL

data mining&

analytics

Via Argonne National Labs under a CC-BY-SA license

analysts are not programmers

not savvy with map/reduce

apache hive

http://hadoop.apache.org/hive/

manage & query data

manage & query dataon top of Hadoop

work by @peteskomoroch

cascading

http://www.cascading.org/

apache pig

http://hadoop.apache.org/pig/

Input  S3  bucket

Output  S3  bucket

Amazon S3

Hadoop

Amazon EC2 Instances

Input dataset

outputresults

Deploy Application

Web Console, Command line tools

End

Notify

Get ResultsInput Data

Amazon Elastic MapReduce

Hadoop Hadoop

Hadoop

Hadoop

Hadoop

Elastic MapReduce

Elastic MapReduce

hadoop and bioinformatics

High Throughput Sequence AnalysisMike Schatz, University of Maryland

Short Read Mapping

Seed & ExtendGood alignments must have significant exact alignment

Minimal exact alignment length = l/(k+1)

Seed & ExtendGood alignments must have significant exact alignment

Minimal exact alignment length = l/(k+1)

Expensive to scale

Seed & ExtendGood alignments must have significant exact alignment

Minimal exact alignment length = l/(k+1)

Expensive to scale

Seed & ExtendGood alignments must have significant exact alignment

Minimal exact alignment length = l/(k+1)

Expensive to scale

Need parallelization framework

CloudBurst

Catalog k-mers Collect seeds End-to-end alignment

http://cloudburst-bio.sourceforge.net; Bioinformatics 2009 25: 1363-1369

Bowtie: Ultrafast short read aligner

Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10 (3): R25.

SOAPSnp: Consensus alignment and SNP calling

Ruiqiang Li, Yingrui Li, Xiaodong Fang, et al. (2009) "SNP detection for massively parallel whole-genome resequencing" Genome Res

Crossbow: Rapid whole genome SNP analysis

Ben Langmead

http://bowtie-bio.sourceforge.net/crossbow/index.shtml

Preprocessed reads

Preprocessed reads

Map: Bowtie

Preprocessed reads

Map: Bowtie

Sort: Bin and partition

Preprocessed reads

Map: Bowtie

Sort: Bin and partition

Reduce: SoapSNP

Crossbow   condenses   over   1,000   hours   of  resequencing   computa:on   into   a   few   hours  without   requiring   the   user   to   own   or   operate   a  computer  cluster

Comparing Genomes

Estimating relative evolutionary rates from sequence comparisons:Identification of probable orthologs

A B C D E

S. cerevisiae C. elegans

species treegene tree

Admissible comparisons: A or B vs. DC vs. E

Inadmissible comparisons: A or B vs. EC vs. D

Estimating relative evolutionary rates from sequence comparisons:

A B C D E

S. cerevisiae C. elegans

species treegene tree

1. Orthologs found using the Reciprocal smallest distance algorithm2. Build alignment between two orthologs>Sequence CMSGRTILASTIAKPFQEEVTKAVKQLNFT-----PKLVGLLSNEDPAAKMYANWTGKTCESLGFKYEL-…

>Sequence EMSGRTILASKVAETFNTEIINNVEEYKKTHNGQGPLLVGFLANNDPAAKMYATWTQKTSESMGFRYDL…

3. Estimate distance given a substitution matrix

Phe Ala Pro Leu ThrPhe Ala µπPro µπ µπ µπLeu µπ µπ µπ µπ

ab

bb

cb

c

c

c

a

b

c

vs.

vs.

vs.

vs.

vs.

vs.

Align sequences &Calculate distances

D=0.2

D=0.3

D=0.1

D=1.2

D=0.1

D=0.9

Orthologs:ib - jc D = 0.1

HL Align sequences &Calculate distances

JcIb

Genome I Genome J

RSD algorithm summary

Prof. Dennis WallHarvard Medical School

Roundup is a database of orthologs and their evolutionary distances.To get started, click browse. Alternatively, you can read our documentation here.

Good luck, researchers!

massive computational demand

1000 genomes = 5,994,000 processes = 23,976,000

hours

2737 years

compared 50+ genomes

trends in data sharing

data motion is hard

cloud services are a viable dataspace

share data

share applications

share results

Data Platform

App Platform

Data Platform

App Platform

Scalable Data Platform

Services

APIs

Getters Filters Savers

WORK

to conclude

big data

change thinking

data managementdata processing

data sharing

think distributed

new software architectures

new computing paradigms

cloud services

the cloud works

deesingh@amazon.com  Twi2er:@mndoci  Presenta4on  ideas  from  @mza,  James  Hamilton,  and  @lessig

Thank  you!

top related