hw09 hadoop for bioinfomatics

Hadoop World, NYC

Hadoop for BioinformaticsDeepak Singh

Amazon Web Services

Via Reavel under a CC-BY-NC-ND license

http://flickr.com/photos/reavel/2404891348/sizes/m/

http://flickr.com/photos/reavel/2404891348/sizes/m/

By ~Prescott under a CC-BY-NC license

http://flickr.com/photos/ppym1/

http://flickr.com/photos/ppym1/

data sets

many data sets

PFAM

GENBANK ENSEMBL

PDB

Many Others

manageable

Image: Matt Wood

Human genome

Image: Matt Wood

Image: Matt Wood

~100 TB/WeekImage: Matt Wood

~100 TB/Week>2 PB/Year

Image: Matt Wood

gigabytes

terabytes

petabytes

really fast

typical informatics workflow

Via Christolakis under a CC-BY-NC-ND license

http://www.flickr.com/photos/43052603@N00/

http://www.flickr.com/photos/43052603@N00/

Via Argonne National Labs under a CC-BY-SA license

http://www.flickr.com/photos/argonne/


Via Argonne National Labs under a CC-BY-SA license

killer app



Via asklar under a CC-BY license

http://www.flickr.com/photos/aslakr/

http://www.flickr.com/photos/aslakr/

Image: Chris Dagdigian

rethink algorithms

rethink computing

rethink data management

rethink data sharing

operational mindset

scalability

we are data geeks not data center geeks

two key trends

develop applications

distribute applications

use applications

some work

some workfilters

^

High Throughput Sequence AnalysisMike Schatz, University of Maryland

• Read Mapping

• Mapping & SNP Discovery

• De novo Genome Assembly

Short Read Mapping

Asian Individual Genome: 3.3 Billion 35bp, 104 GB (Wang et al., 2008)

African Individual Genome: 4.0 Billion 35bp, 144 GB (Bentley et al., 2008)

Alignment > 10000 CPU hrs

Seed & ExtendGood alignments must have significant exact alignment

Minimal exact alignment length = l/(k+1)



Expensive to scale



Expensive to scale

Need parallelization framework

CloudBurst

Catalog k-mers Collect seeds End-to-end alignment

http://cloudburst-bio.sourceforge.net; Bioinformatics 2009 25: 1363-1369

http://cloudburst-bio.sourceforge.net

http://cloudburst-bio.sourceforge.net

CloudBurst efficiently reports every k-difference alignment of every read

many applications only need the best alignment

Bowtie: Ultrafast short read aligner

Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10 (3): R25.

SOAPSnp: Consensus alignment and SNP calling


Crossbow: Rapid whole genome SNP analysis


Ben Langmead

Preprocessed reads

Preprocessed reads

Map: Bowtie

Preprocessed reads

Map: Bowtie

Sort: Bin and partition

Preprocessed reads

Map: Bowtie

Sort: Bin and partition

Reduce: SoapSNP

Crossbow condenses over 1,000 hours of resequencing computa:on into a few hours without requiring the user to own or operate a computer cluster

Comparing Genomes

Estimating relative evolutionary rates from sequence comparisons:Identification of probable orthologs

A B C D E

S. cerevisiae C. elegans

species treegene tree

Admissible comparisons: A or B vs. DC vs. E

Inadmissible comparisons: A or B vs. EC vs. D

Estimating relative evolutionary rates from sequence comparisons:

A B C D E

S. cerevisiae C. elegans

species treegene tree

1. Orthologs found using the Reciprocal smallest distance algorithm2. Build alignment between two orthologs>Sequence CMSGRTILASTIAKPFQEEVTKAVKQLNFT-----PKLVGLLSNEDPAAKMYANWTGKTCESLGFKYEL-…

>Sequence EMSGRTILASKVAETFNTEIINNVEEYKKTHNGQGPLLVGFLANNDPAAKMYATWTQKTSESMGFRYDL…

3. Estimate distance given a substitution matrix

Phe Ala Pro Leu ThrPhe Ala µπPro µπ µπ µπLeu µπ µπ µπ µπ

ab

bb

cb

c

c

c

a

b

c

vs.

vs.

vs.

vs.

vs.

vs.

Align sequences &Calculate distances

D=0.2

D=0.3

D=0.1

D=1.2

D=0.1

D=0.9

Orthologs:ib - jc D = 0.1

HL Align sequences &Calculate distances

JcIb

Genome I Genome J

RSD algorithm summary

Prof. Dennis WallHarvard Medical School

Roundup is a database of orthologs and their evolutionary distances.To get started, click browse. Alternatively, you can read our documentation here.

Good luck, researchers!

http://roundup.hms.harvard.edu/roundup/index.php?action=input_browse

http://roundup.hms.harvard.edu/roundup/index.php?action=input_browse

http://roundup.hms.harvard.edu/site/documentation

http://roundup.hms.harvard.edu/site/documentation

massive computational demand

1000 genomes = 5,994,000 processes = 23,976,000 hours

2737 years

periodic task

must scale up

not scalability gurus

hadoop streaming

compared 50+ genomes

what’s next?

de novo assembly

machine learning and statistics

protein structure prediction

docking

trajectory analysis

key driving factors?

the ecosystem

Cascading

domain specific libraries and tools

http://aws.amazon.com/publicdatasets/



http://aws.amazon.com/education/

[email protected]; Twitter:@mndoci Presentation ideas from @mza, @simon and @lessig

Thank you!

mailto:[email protected]

mailto:[email protected]

hw09 hadoop for bioinfomatics

Technology