hw09 hadoop for bioinfomatics
TRANSCRIPT
Hadoop World, NYC
Hadoop for BioinformaticsDeepak Singh
Amazon Web Services
Via Reavel under a CC-BY-NC-ND license
By ~Prescott under a CC-BY-NC license
data sets
many data sets
PFAM
GENBANK ENSEMBL
PDB
Many Others
manageable
Image: Matt Wood
Human genome
Image: Matt Wood
Image: Matt Wood
~100 TB/WeekImage: Matt Wood
~100 TB/Week>2 PB/Year
Image: Matt Wood
years
days
hours
gigabytes
terabytes
petabytes
really fast
typical informatics workflow
Via Christolakis under a CC-BY-NC-ND license
Via Argonne National Labs under a CC-BY-SA license
Via Argonne National Labs under a CC-BY-SA license
killer app
Via asklar under a CC-BY license
Image: Chris Dagdigian
rethink algorithms
rethink computing
rethink data management
rethink data sharing
operational mindset
scalability
we are data geeks not data center geeks
two key trends
develop applications
distribute applications
use applications
some work
some workfilters
^
High Throughput Sequence AnalysisMike Schatz, University of Maryland
• Read Mapping
• Mapping & SNP Discovery
• De novo Genome Assembly
Short Read Mapping
Asian Individual Genome: 3.3 Billion 35bp, 104 GB (Wang et al., 2008)
African Individual Genome: 4.0 Billion 35bp, 144 GB (Bentley et al., 2008)
Alignment > 10000 CPU hrs
Seed & ExtendGood alignments must have significant exact alignment
Minimal exact alignment length = l/(k+1)
Seed & ExtendGood alignments must have significant exact alignment
Minimal exact alignment length = l/(k+1)
Expensive to scale
Seed & ExtendGood alignments must have significant exact alignment
Minimal exact alignment length = l/(k+1)
Expensive to scale
Seed & ExtendGood alignments must have significant exact alignment
Minimal exact alignment length = l/(k+1)
Expensive to scale
Need parallelization framework
CloudBurst
Catalog k-mers Collect seeds End-to-end alignment
http://cloudburst-bio.sourceforge.net; Bioinformatics 2009 25: 1363-1369
CloudBurst efficiently reports every k-difference alignment of every read
many applications only need the best alignment
Bowtie: Ultrafast short read aligner
Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10 (3): R25.
SOAPSnp: Consensus alignment and SNP calling
Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10 (3): R25.
Crossbow: Rapid whole genome SNP analysis
Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10 (3): R25.
Ben Langmead
Preprocessed reads
Preprocessed reads
Map: Bowtie
Preprocessed reads
Map: Bowtie
Sort: Bin and partition
Preprocessed reads
Map: Bowtie
Sort: Bin and partition
Reduce: SoapSNP
Crossbow condenses over 1,000 hours of resequencing computa:on into a few hours without requiring the user to own or operate a computer cluster
Comparing Genomes
Estimating relative evolutionary rates from sequence comparisons:Identification of probable orthologs
A B C D E
S. cerevisiae C. elegans
species treegene tree
Admissible comparisons: A or B vs. DC vs. E
Inadmissible comparisons: A or B vs. EC vs. D
Estimating relative evolutionary rates from sequence comparisons:
A B C D E
S. cerevisiae C. elegans
species treegene tree
1. Orthologs found using the Reciprocal smallest distance algorithm2. Build alignment between two orthologs>Sequence CMSGRTILASTIAKPFQEEVTKAVKQLNFT-----PKLVGLLSNEDPAAKMYANWTGKTCESLGFKYEL-…
>Sequence EMSGRTILASKVAETFNTEIINNVEEYKKTHNGQGPLLVGFLANNDPAAKMYATWTQKTSESMGFRYDL…
3. Estimate distance given a substitution matrix
Phe Ala Pro Leu ThrPhe Ala µπPro µπ µπ µπLeu µπ µπ µπ µπ
ab
bb
cb
c
c
c
a
b
c
vs.
vs.
vs.
vs.
vs.
vs.
Align sequences &Calculate distances
D=0.2
D=0.3
D=0.1
D=1.2
D=0.1
D=0.9
Orthologs:ib - jc D = 0.1
HL Align sequences &Calculate distances
JcIb
Genome I Genome J
RSD algorithm summary
Prof. Dennis WallHarvard Medical School
Roundup is a database of orthologs and their evolutionary distances.To get started, click browse. Alternatively, you can read our documentation here.
Good luck, researchers!
massive computational demand
1000 genomes = 5,994,000 processes = 23,976,000 hours
2737 years
periodic task
must scale up
not scalability gurus
hadoop streaming
compared 50+ genomes
what’s next?
de novo assembly
machine learning and statistics
protein structure prediction
docking
trajectory analysis
key driving factors?
the ecosystem
Pig
Cascading
Hive
RHIPE
domain specific libraries and tools
http://aws.amazon.com/publicdatasets/
http://aws.amazon.com/education/
[email protected]; Twitter:@mndoci Presentation ideas from @mza, @simon and @lessig
Thank you!