ParalleldenovoAssemblyofComplex(Meta)GenomesviaHipMer
AydınBuluçComputationalResearchDivision,LBNL
May23,2016InvitedTalkatHiCOMB 2016
Jointwork(alphabetical)withChaitanyaAluru,JarrodChapman,RobEgan,Evangelos Georganas,StevenHofmeyr,LennyOliker,DanRokhsar,KathyYelick
1. ParallelDeBruijn GraphConstructionandTraversalfordenovoGenomeAssembly,SC’14
2. Awhole-genomeshotgunapproachforassemblingandanchoringthehexaploid breadwheatgenome.GenomeBiology,16(26),2015.
3. meraligner:Afullyparallelsequencealigner,IPDPS’154. HipMer:AnExtreme-ScaleDeNovoGenomeAssembler,SC’15
OutlineandAcknowledgments
Workisfundedby
DenovoGenomeAssembly
1.Threecopiesofthesamenovel.
For all men tragically great are made so through a certain morbidness… all mortal greatness is but disease.
2.Sometextfromthenovel.Allpageswillberandomlycutintostripsofcharacters.Randomtypos(errors) throughouteachnovel.
For a
all men tragically g
ally great
great are made so
3.Afewstripsofcharactersfromonepage.
4.Allofthestripsofcharactersfromthe3novels.
5.Everystripmustbeassembledasshownheretocreateasinglecopyofthenovel.
For all men tragically great are made so
For a ally great
great are made so all men tragically g
1.ThreecopiesofthesameDNA.
2.SomepartoftheDNAsequence.Itwillbereadintostrips.Therearerandomerrors throughoutthesequence.
ACCGTAGCAAAACCGGGTAGTCATACTACTACGTACTCATCT
3.Thesequenceisreadintosmallerpieces(reads).CannotreadwholeDNAsequenceinonego.
ACCGTAGCAA
AAACCGGGTA TAGTCATACT
AAACCGGGTA
ACTACGTACT
4.Allreads
5.ReconstructoriginalDNAsequencefromthereadset.
ACCGTAGCAAAACCGGGTAGTCATACTACTACGTACTCATCT
ACCGTAGCAA
AAACCGGGTA
GTAGTCATACT
CTACTACGTAC
CGTACTCATCT
• There is no genome reference! – In principle we want to reconstruct unknown genome sequence.
• Reads are significantly shorter than whole genome.– Reads consist of 20 to 30K bases– Genomes vary in length and complexity – up to 30G bases
• Reads include errors.
• Genomes have repetitive regions.– Repetitive regions increase genome complexity.
DenovoGenomeAssemblyishard
• Switchgrass:1.4Giga-base pairs(Gbp)• Maize:2.4Gbp• Miscanthus:2.5Gbp• Humangenome:3Gbp• Barleygenome:7Gbp• Wheatgenome:17Gbp• Pinegenome:20Gbp• Salamander:20-30Gbp
Genomesvaryinsize
reads
contigs
scaffolds
k-mers
1
2
3
Input:Readsthatmaycontainerrors
Chopreadsintok-mers,processk-mers toexcludeerrors
Construct&traversedeBruijn graphofk-mers,generatecontigs
Leveragereadinformationtolinkcontigs andgeneratescaffolds.
GenomeAssemblyalaMeraculous
I/O,bandwidth,andmemoryintensive
Latencybound(irregularaccesses)
ComputeandI/Ointensive
MeraculousparallelizationwithUPC
• UPC is a PGAS (Partitioned Global Address Space) parallel language with one-sided communication.
• Core graph algorithms implemented in UPC (graph ó hash table).
• We need the notion of a huge global distributed hash table.• Irregular access pattern in the the distributed hash table
– One-sided communication is handy!• Portable implementation: can run any machine without change!• Result of this work: Complete assembly of human genome in
8.4 minutes using 15K cores
• Original code required 2 days and a large memory machine.
Reads
Parallelk-mer analysis:pass1
P0
P1
Pn
…
Parsetok-mers
Hyperloglog
Hyperloglog
Hyperloglog
k-mer cardinalityestimation
G=Globalcardinalityestimation
ParallelReduction
BloomFilterofsize(G/n)
BloomFilterofsize(G/n)
BloomFilterofsize(G/n)
CreatelocalBloomfilters
Alsoidentifyhigh-frequencyk-mers
atthisstep
Reads
Parallelk-mer analysis:pass2
P0
P1
Pn
…
Parsetok-mers
Hashk-mers
Hashk-mers
Hashk-mers
Hashk-mers &findowners
Receivedk-mers
Receivedk-mers
Receivedk-mers
All-to-allcommunication
ofk-mers
BloomFilter
BloomFilter
BloomFilter
Localset
Localset
Localset
Storeak-mer inthelocalsetonlyifitwas
seenbefore
Thelocalsetdoestheactualcounting
ofthek-mers
Bloomfilterisaprobabilistic datastructureusedformembershipqueries• Givenabloomfilter,wecanask:“Haveweseenthisk-mer before?”• Nofalsenegatives.• Mayhavefalsepositives(inpractice5%falsepositiverate)
k-mers withfrequency=1areuseless(eithererrororcannotbedistinguishedfromerror),andcansafelybeeliminated.
WhyuseaBloomfilter?
Reads
Parallelk-mer analysis:pass3
P0
P1
Pn
…
Parsetok-mers
Hashk-mers
Hashk-mers
Hashk-mers
Findextensionsofk-mers,hashthem&findtheirowners
Receivedk-mers &extensions
Receivedk-mers &extensions
Receivedk-mers &extensions
All-to-allcommunicationofk-mers andextensions
Keeptrackofthenumberofoccurrencesofeachextensionfor
eachk-mer
Localset
Localset
Localset
ACCCACTCTTAGCFAACCTTGCGCATXA
AGGCAATGGTAGFFAAAATTGCCCATXX
TTCCAGTTTTGCCAAACTTGGCTTTTCA
Useathreshold&findthehigh
qualityextensionsofk-mers
High-frequencyk-mers
Long-taileddistributionforgenomeswithrepetitivecontent:- Themaximumcountforanyk-mer inthewheatdatasetis451million- Ouroriginalscheme(SC’14)was“ownercounts”,afteranall-to-all- Countinganitemw/451millionoccurrencesaloneisloadimbalanced
Solution:Quicklyidentifyhigh-frequencyk-mers usingminimalcommunicationduringthe“cardinalityestimation”stepandtreatthemspeciallybyusinglocalcounters.
• InMeraculous,thedeBruijn graphisrepresentedasahashtable
• K-mers arebothnodesinthegraph&keysinthehashtable
• Anedgeinthegraphconnectstwonodesthatoverlapink-1bases
• Theedgesinthegraphareputinthehashtablebystoringtheextensionsofthek-mers astheircorrespondingvalues
GAT ATC TCT CTG TGA
AAC
ACC
CCG
AAT
ATG
TGC
ParallelDeBruijn GraphConstruction
ParallelDeBruijn GraphConstruction
• Challenge 1: The hash table that represents the de Bruijngraph is large (100s of GBs up to 10s of TBs !)– Solution: Distribute the graph over multiple processors.
The global address space of UPC is handy!
• Challenge 2: Parallel hash tables construction introduces communication and synchronization costs– Solution: Aggregate messages to reduce number of
messages and synchronization à 10x-20x performance improvement.
Pi
…
LocalbufferforP0
LocalbufferforP1
LocalbufferforPn
BufferlocaltoP0
DistributedHashtable
LocaltoP0
LocaltoP0
LocaltoP0
LocaltoP0
Aggregatingstoresoptimization
P0
P0 storesthek-mers &extensionsinitslocalbucketsinalock-free&communication-freefashion
Goal:• TraversethedeBruijn graphandfindUUcontigs (chainsofUU
nodes),oralternatively• findtheconnectedcomponentswhichconsistoftheUUcontigs.
• Mainidea:– Pickaseed– Iterativelyextenditbyconsecutivelookupsinthedistributedhash
table
GAT ATC TCT CTG TGA
AAC
ACC
CCG
AAT
ATG
TGC
Contig 1:GATCTGAContig 2:AACCG
Contig 3:AATGC
ParallelDeBruijn GraphTraversal
CGTATTGCCAATGCAACGTATCATGGCCAATCCGAT
Assumeone oftheUUcontigs tobeassembledis:
ParallelDeBruijn GraphTraversal
ParallelDeBruijn GraphTraversal
CGTATTGCCAATGCAACGTATCATGGCCAATCCGAT
ProcessorPi picksarandomk-mer fromthedistributedhashtableasseed:
Pi knowsthatforwardextensionisA
Pi usesthelastk-1basesandtheforwardextensionandforms:CAACGTATCA
Pi doesalookupinthedistributedhashtableforCAACGTATCA
Pi iteratesthisprocessuntilitreachesthe“right”endpointoftheUUcontig
Pi alsoiteratesthisprocessbackwardsuntilitreachesthe“left”endpointoftheUUcontig
MultipleprocessorsonthesameUUcontig
CGTATTGCCAATGCAACGTATCATGGCCAATCCGAT
However,processorsPi,Pj andPt mighthavepickedinitialseedsfromthesameUUcontig
Pi Pj Pt
• ProcessorsPi,Pj andPt havetocollaborateandconcatenatesubcontigsinordertoavoidredundantwork.
• Solution:lightweightsynchronizationschemebasedonastatemachine
reads
contigs
scaffolds
k-mers
1
2
3PARALLELIZINGSCAFFOLDING
AlignmentforDenovο GenomeAssembly
Tobereusedformetagenomicsw/modifications
Read 42
Contig 101 Contig 61
Read ID start-pos end-pos Contig ID start-pos end-pos--------------------------------------------------------------------------------------------------Read 42 1 4 Contig 101 152 155Read 42 130 150 Contig 61 1 21Read 90 1 150 Contig 500 101 250
Read 90
Contig 500
Queries(Reads)
Targets(Contigs)
• Aqueryandatargetshouldmatchinatleastkbasesinordertobealigned• Wecallseed asubstringofasequence(queryortarget)withlengthequaltok
Aligningqueriestotargetsequences
A C T G G G G C A Target 0: Target 1:
Seed Index
Buildingseedindex
Seed: ACT
�
A C T G G G G C A Target 0: Target 1:
Seed Index
Buildingseedindex
Seed: ACT
�Seed: CTG
�
A C T G G G G C A Target 0: Target 1:
Seed Index
Buildingseedindex
Seed: ACT
�Seed: CTG
�Seed: TGG
�
A C T G G G G C A Target 0: Target 1:
Seed Index
Buildingseedindex
Seed: ACT
�Seed: GGC
�Seed: CTG
�Seed: TGG
�
A C T G G G G C A Target 0: Target 1:
Seed Index
Buildingseedindex
Seed: ACT
�Seed: GGC
�Seed: CTG
�Seed: TGG
�Seed: GCA
�
A C T G G G G C A Target 0: Target 1:
Seed Index
Buildingseedindex
Seed: ACT
�Seed: GGC
�Seed: CTG
�Seed: TGG
�Seed: GCA
�
A C T G G G G C A Target 0: Target 1:
Seed Index
query sequence : G C T G
query’s seed
Lookupseedindex
SeedGCTisnotfoundinseedindex
Seed: ACT
�Seed: GGC
�Seed: CTG
�Seed: TGG
�Seed: GCA
�
A C T G G G G C A Target 0: Target 1:
Seed Index
query sequence : G C T G
query’s seed
Lookupseedindex
• Indenovoassembly,billionsofreadsmustbealignedtocontigs• Firstalignertoparallelizetheseedindexconstruction(“fully”parallel)
Evangelos Georganas, Aydın Buluç,JarrodChapman,LeonidOliker,DanielRokhsar,andKatherineYelick.meraligner:Afullyparallelsequencealigner.InProceedingsoftheIPDPS,2015.
128
256
512
1024
2048
4096
8192
16384
480 960 1920 3840 7680 15360
Seco
nds
Number of Cores
merAligner-wheatideal-wheatmerAligner-humanideal-humanBWAmem-humanBowtie2-human
ParallelGenomeAlignmentforDenovoAssembly
Scaffoldingbeyondalignment
• High-level idea: Leverage read-to-contig information to generate links among contigs.• Distributed hash tables to index the link information.
• Formacontig graphbyusingthelinksandtraverseittoformscaffolds.
contig i contig j
LINK : < contig i ! contig j , link info >
contig 1 contig 2 contig 3 contig 4
tie 1!2 tie 2!3 tie 3!4
Scaffold Link 1!2 Link 2!3 Link 3!4
Scaffoldingbeyondalignment
• Computingcontig depthsandterminationstates• Contig bubbleidentification(fordiploidgenomes)• Orderingandorientationofcontigs(inherentlyserialasimplemented)• Insertsizeestimation• Locatingsplints andspans• Contig linkgeneration• GapclosingConceptually:Mini/localassembly– embarrassinglyparallelizable-
Toolsemployed:- Distributedhashtables- Speculativeexecution
read contig k
contig m
SPLINT
read 1 read 2
contig i
contig j
SPAN (a) (b)
Gap 1 Gap 2 Gap 3
Task 1 Task 2 Task 3
Gap closing
Assembled genome
Strongscaling(wheatgenome)onCrayXC30
32
64
128
256
512
1024
2048
4096
8192
16384
960 1920 3840 7680 15360
Seco
nds
Number of Cores
overall timekmer analysis
contig generationscaffolding
ideal overall time
• Completeassemblyofwheatgenomein39minutes(15Kcores).• OriginalMeraculous wouldrequire(projectedtime)aweek(~300x
slower) andasharedmemorymachinewith1TBmemory.
Strongscaling(humangenome)onCrayXC30@SC’15
• Completeassemblyofhumangenomein8.4minutes(15Kcores).• 350xspeedupoveroriginalMeraculous(took2,880minutesanda
largesharedmemorymachine).
32
64
128
256
512
1024
2048
4096
8192
480 960 1920 3840 7680 15360
Seco
nds
Number of Cores
overall timekmer analysis
contig generationscaffolding
ideal overall time
Strongscaling(humangenome)onCrayXC30@Now
• Completeassemblyofhumangenomein4 minutesusing23Kcores.• 700xspeedupoveroriginalMeraculous (took2,880minutesandalarge
sharedmemorymachine).
16
32
64
128
256
512
1024
2048
4096
8192
1920 3840 7680 15360 23040
Seco
nds
Number of Cores
ideal IO cached overall No IO cached
overall IO cachedkmer analysis
contig generationscaffolding
IO
• Insteadofmultiplecopiesofthesamebook,howyouhavethewholelibrarytoassemble!
Howaboutmetagenomes?
• 1984isthe“dominantspecies”inthissample.• Tooeasy: therearealsopreviouslyunknownbooksinthemix.
Option1:Binthereadstogenomes;runsinglegenomeassemblyL Readsaretooshorttocontainenoughinformationforbinning.Option2:Generatecontigs andbinthecontigsL Howdoyoueliminateerrorsforlow-coverageorganisms?
Metagenomechallenges
- Whydoesmetagenomeassemblybenefitsfromexpliciterrorcorrectionwhensinglegenomeassemblydoesnot?+Unevensequencingdepth:Errorsinhigh-depthregionsmightbemorefrequentthantruek-mers inlow-depthregions.- Whocaresaboutlow-depthregions?+Infact,that’swhatmattersmost.
Singlegenome Metagenome
AdaptingHipMertometagenomes
persistent
contiggeneration
alignment scaffolding
B
Datasizes
gapclosing
B
E
G
E
k-meranalysis Increasek
anditerate
A10TB
1TB
100GB
A
CCC
D
C
A
C
D
F
D
Inputs:(A) Shortreadsand(F)longinsert(matepaired)librariesIntermediate:(B)Error-freek-mers w/extensions(a.k.a.UFX),(C)contigs…Output:(G)finalgenomescaffolds
Secretmetagenomesauce#1
DEGASBerkeleymeeting
k-mer analysis
De Bruijn graph traversal
Bubble merging and hair removal
Iterative graph pruning
Local Assembly
Iterate for larger k
bubble
hair
Iterativecontig generation
DEGASBerkeleymeeting
k-mer analysis
De Bruijn graph traversal
Bubble merging and hair removal
Iterative graph pruning
Local Assembly
Iterate for larger k
bubble
hair
Iterativecontig generation
k-mer analysis
De Bruijn graph traversal
Bubble merging and hair removal
Iterative graph pruning
Local Assembly
Iterate for larger k
bubble
hair
Iterativecontig generation
k-mer analysis
De Bruijn graph traversal
Bubble merging and hair removal
Iterative graph pruning
Local Assembly
Iterate for larger k
bubble
hair
Iterativecontig generation
k-mer analysis
De Bruijn graph traversal
Bubble merging and hair removal
Iterative graph pruning
Local Assembly
Iterate for larger k
bubble
hair
Iterativecontig generation
k-mer analysis
De Bruijn graph traversal
Bubble merging and hair removal
Iterative graph pruning
Local Assembly
Iterate for larger k
bubble
hair
Iterativecontig generation
k-mer analysis
De Bruijn graph traversal
Bubble merging and hair removal
Iterative graph pruning
Local Assembly
Iterate for larger k
bubble
hair
Iterativecontig generation
HipMeronaMOCKcommunity
– Mixof25bacteriawithdifferentabundancies (JGIdataset)– Datasetsize~100GBytes
Statistics metaHipMer metaSPAdes
#contigs 2,670 5,100
Totallength> 10kbp 92,245,198 92,152,728
Totallength>50kbp 69,493,125 77,292,121
Misassemblies 58 134
Mismatchesper100kbp 3.48 77.05
Genomesfraction(%) 92.10 91.10
Summary
• HipMer’s corealgorithmsscaletotensofthousandsofcoresandyieldperformanceimprovementsfromdays/weeksdowntominutes.
• HipMerbreaksthehardwarelimitationsbyenablingdistributed-memoryscaling
• Useofdenovoassemblyintimesensitiveapplicationslikeprecisionmedicineisnomoreformidable!
• SourcereleaseofHipMer:https://sourceforge.net/projects/hipmer/
• Ongoingwork:AdaptHipMerformetagenomicanalysis(somehintsandpreliminaryresultsgiveninthistalk).