hw09 hadoop for bioinfomatics

109
Hadoop World, NYC Hadoop for Bioinformatics Deepak Singh Amazon Web Services

Upload: cloudera-inc

Post on 14-Jul-2015

1.588 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: Hw09   Hadoop For Bioinfomatics

Hadoop World, NYC

Hadoop for BioinformaticsDeepak Singh

Amazon Web Services

Page 2: Hw09   Hadoop For Bioinfomatics
Page 4: Hw09   Hadoop For Bioinfomatics

By ~Prescott under a CC-BY-NC license

Page 5: Hw09   Hadoop For Bioinfomatics

data sets

Page 6: Hw09   Hadoop For Bioinfomatics

many data sets

Page 7: Hw09   Hadoop For Bioinfomatics

PFAM

GENBANK ENSEMBL

PDB

Many Others

Page 8: Hw09   Hadoop For Bioinfomatics

manageable

Page 9: Hw09   Hadoop For Bioinfomatics

Image: Matt Wood

Page 10: Hw09   Hadoop For Bioinfomatics

Human genome

Image: Matt Wood

Page 11: Hw09   Hadoop For Bioinfomatics
Page 12: Hw09   Hadoop For Bioinfomatics
Page 13: Hw09   Hadoop For Bioinfomatics

Image: Matt Wood

Page 14: Hw09   Hadoop For Bioinfomatics

~100 TB/WeekImage: Matt Wood

Page 15: Hw09   Hadoop For Bioinfomatics

~100 TB/Week>2 PB/Year

Image: Matt Wood

Page 16: Hw09   Hadoop For Bioinfomatics
Page 17: Hw09   Hadoop For Bioinfomatics

years

Page 18: Hw09   Hadoop For Bioinfomatics

days

Page 19: Hw09   Hadoop For Bioinfomatics

hours

Page 20: Hw09   Hadoop For Bioinfomatics

gigabytes

Page 21: Hw09   Hadoop For Bioinfomatics

terabytes

Page 22: Hw09   Hadoop For Bioinfomatics

petabytes

Page 23: Hw09   Hadoop For Bioinfomatics

really fast

Page 24: Hw09   Hadoop For Bioinfomatics
Page 25: Hw09   Hadoop For Bioinfomatics

typical informatics workflow

Page 26: Hw09   Hadoop For Bioinfomatics
Page 27: Hw09   Hadoop For Bioinfomatics
Page 28: Hw09   Hadoop For Bioinfomatics
Page 29: Hw09   Hadoop For Bioinfomatics
Page 31: Hw09   Hadoop For Bioinfomatics

Via Argonne National Labs under a CC-BY-SA license

Page 32: Hw09   Hadoop For Bioinfomatics

Via Argonne National Labs under a CC-BY-SA license

killer app

Page 34: Hw09   Hadoop For Bioinfomatics
Page 35: Hw09   Hadoop For Bioinfomatics
Page 36: Hw09   Hadoop For Bioinfomatics

Image: Chris Dagdigian

Page 37: Hw09   Hadoop For Bioinfomatics
Page 38: Hw09   Hadoop For Bioinfomatics

rethink algorithms

Page 39: Hw09   Hadoop For Bioinfomatics

rethink computing

Page 40: Hw09   Hadoop For Bioinfomatics

rethink data management

Page 41: Hw09   Hadoop For Bioinfomatics

rethink data sharing

Page 42: Hw09   Hadoop For Bioinfomatics

operational mindset

Page 43: Hw09   Hadoop For Bioinfomatics

scalability

Page 44: Hw09   Hadoop For Bioinfomatics

we are data geeks not data center geeks

Page 45: Hw09   Hadoop For Bioinfomatics

two key trends

Page 46: Hw09   Hadoop For Bioinfomatics
Page 47: Hw09   Hadoop For Bioinfomatics
Page 48: Hw09   Hadoop For Bioinfomatics
Page 49: Hw09   Hadoop For Bioinfomatics

develop applications

Page 50: Hw09   Hadoop For Bioinfomatics

distribute applications

Page 51: Hw09   Hadoop For Bioinfomatics

use applications

Page 52: Hw09   Hadoop For Bioinfomatics

some work

Page 53: Hw09   Hadoop For Bioinfomatics

some workfilters

^

Page 54: Hw09   Hadoop For Bioinfomatics

High Throughput Sequence AnalysisMike Schatz, University of Maryland

Page 55: Hw09   Hadoop For Bioinfomatics

• Read Mapping

• Mapping & SNP Discovery

• De novo Genome Assembly

Page 56: Hw09   Hadoop For Bioinfomatics

Short Read Mapping

Page 57: Hw09   Hadoop For Bioinfomatics

Asian Individual Genome: 3.3 Billion 35bp, 104 GB (Wang et al., 2008)

African Individual Genome: 4.0 Billion 35bp, 144 GB (Bentley et al., 2008)

Page 58: Hw09   Hadoop For Bioinfomatics

Alignment > 10000 CPU hrs

Page 59: Hw09   Hadoop For Bioinfomatics

Seed & ExtendGood alignments must have significant exact alignment

Minimal exact alignment length = l/(k+1)

Page 60: Hw09   Hadoop For Bioinfomatics

Seed & ExtendGood alignments must have significant exact alignment

Minimal exact alignment length = l/(k+1)

Expensive to scale

Page 61: Hw09   Hadoop For Bioinfomatics

Seed & ExtendGood alignments must have significant exact alignment

Minimal exact alignment length = l/(k+1)

Expensive to scale

Page 62: Hw09   Hadoop For Bioinfomatics

Seed & ExtendGood alignments must have significant exact alignment

Minimal exact alignment length = l/(k+1)

Expensive to scale

Need parallelization framework

Page 63: Hw09   Hadoop For Bioinfomatics

CloudBurst

Catalog k-mers Collect seeds End-to-end alignment

Page 64: Hw09   Hadoop For Bioinfomatics

http://cloudburst-bio.sourceforge.net; Bioinformatics 2009 25: 1363-1369

Page 65: Hw09   Hadoop For Bioinfomatics
Page 66: Hw09   Hadoop For Bioinfomatics

CloudBurst efficiently reports every k-difference alignment of every read

Page 67: Hw09   Hadoop For Bioinfomatics

many applications only need the best alignment

Page 68: Hw09   Hadoop For Bioinfomatics

Bowtie: Ultrafast short read aligner

Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10 (3): R25.

Page 69: Hw09   Hadoop For Bioinfomatics

SOAPSnp: Consensus alignment and SNP calling

Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10 (3): R25.

Page 70: Hw09   Hadoop For Bioinfomatics

Crossbow: Rapid whole genome SNP analysis

Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10 (3): R25.

Ben Langmead

Page 71: Hw09   Hadoop For Bioinfomatics
Page 72: Hw09   Hadoop For Bioinfomatics

Preprocessed reads

Page 73: Hw09   Hadoop For Bioinfomatics

Preprocessed reads

Map: Bowtie

Page 74: Hw09   Hadoop For Bioinfomatics

Preprocessed reads

Map: Bowtie

Sort: Bin and partition

Page 75: Hw09   Hadoop For Bioinfomatics

Preprocessed reads

Map: Bowtie

Sort: Bin and partition

Reduce: SoapSNP

Page 76: Hw09   Hadoop For Bioinfomatics

Crossbow   condenses   over   1,000   hours   of  resequencing   computa:on   into   a   few   hours  without   requiring   the   user   to   own   or   operate   a  computer  cluster

Page 77: Hw09   Hadoop For Bioinfomatics

Comparing Genomes

Page 78: Hw09   Hadoop For Bioinfomatics

Estimating relative evolutionary rates from sequence comparisons:Identification of probable orthologs

A B C D E

S. cerevisiae C. elegans

species treegene tree

Admissible comparisons: A or B vs. DC vs. E

Inadmissible comparisons: A or B vs. EC vs. D

Page 79: Hw09   Hadoop For Bioinfomatics

Estimating relative evolutionary rates from sequence comparisons:

A B C D E

S. cerevisiae C. elegans

species treegene tree

1. Orthologs found using the Reciprocal smallest distance algorithm2. Build alignment between two orthologs>Sequence CMSGRTILASTIAKPFQEEVTKAVKQLNFT-----PKLVGLLSNEDPAAKMYANWTGKTCESLGFKYEL-…

>Sequence EMSGRTILASKVAETFNTEIINNVEEYKKTHNGQGPLLVGFLANNDPAAKMYATWTQKTSESMGFRYDL…

3. Estimate distance given a substitution matrix

Phe Ala Pro Leu ThrPhe Ala µπPro µπ µπ µπLeu µπ µπ µπ µπ

Page 80: Hw09   Hadoop For Bioinfomatics

ab

bb

cb

c

c

c

a

b

c

vs.

vs.

vs.

vs.

vs.

vs.

Align sequences &Calculate distances

D=0.2

D=0.3

D=0.1

D=1.2

D=0.1

D=0.9

Orthologs:ib - jc D = 0.1

HL Align sequences &Calculate distances

JcIb

Genome I Genome J

RSD algorithm summary

Page 81: Hw09   Hadoop For Bioinfomatics

Prof. Dennis WallHarvard Medical School

Page 82: Hw09   Hadoop For Bioinfomatics

Roundup is a database of orthologs and their evolutionary distances.To get started, click browse. Alternatively, you can read our documentation here.

Good luck, researchers!

Page 83: Hw09   Hadoop For Bioinfomatics

massive computational demand

Page 84: Hw09   Hadoop For Bioinfomatics

1000 genomes = 5,994,000 processes = 23,976,000 hours

Page 85: Hw09   Hadoop For Bioinfomatics

2737 years

Page 86: Hw09   Hadoop For Bioinfomatics

periodic task

Page 87: Hw09   Hadoop For Bioinfomatics

must scale up

Page 88: Hw09   Hadoop For Bioinfomatics

not scalability gurus

Page 89: Hw09   Hadoop For Bioinfomatics

hadoop streaming

Page 90: Hw09   Hadoop For Bioinfomatics
Page 91: Hw09   Hadoop For Bioinfomatics

compared 50+ genomes

Page 92: Hw09   Hadoop For Bioinfomatics

what’s next?

Page 93: Hw09   Hadoop For Bioinfomatics

de novo assembly

Page 94: Hw09   Hadoop For Bioinfomatics

machine learning and statistics

Page 95: Hw09   Hadoop For Bioinfomatics

protein structure prediction

Page 96: Hw09   Hadoop For Bioinfomatics

docking

Page 97: Hw09   Hadoop For Bioinfomatics

trajectory analysis

Page 98: Hw09   Hadoop For Bioinfomatics

key driving factors?

Page 99: Hw09   Hadoop For Bioinfomatics

the ecosystem

Page 100: Hw09   Hadoop For Bioinfomatics

Pig

Page 101: Hw09   Hadoop For Bioinfomatics

Cascading

Page 102: Hw09   Hadoop For Bioinfomatics

Hive

Page 103: Hw09   Hadoop For Bioinfomatics

RHIPE

Page 104: Hw09   Hadoop For Bioinfomatics

domain specific libraries and tools

Page 106: Hw09   Hadoop For Bioinfomatics
Page 107: Hw09   Hadoop For Bioinfomatics

http://aws.amazon.com/education/

Page 108: Hw09   Hadoop For Bioinfomatics
Page 109: Hw09   Hadoop For Bioinfomatics

[email protected]; Twitter:@mndoci Presentation ideas from @mza, @simon and @lessig

Thank you!