parallel de novo assembly of complex (meta) genomes via …aydin/hicomb.pdfparallel de novo assembly...

47
Parallel de novo Assembly of Complex (Meta) Genomes via HipMer Aydın Buluç Computational Research Division, LBNL May 23, 2016 Invited Talk at HiCOMB 2016

Upload: others

Post on 18-Mar-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Parallel de novo Assembly of Complex (Meta) Genomes via …aydin/HiCOMB.pdfParallel de novo Assembly of Complex (Meta) Genomes via HipMer ... EvangelosGeorganas, Steven Hofmeyr, Lenny

ParalleldenovoAssemblyofComplex(Meta)GenomesviaHipMer

AydınBuluçComputationalResearchDivision,LBNL

May23,2016InvitedTalkatHiCOMB 2016

Page 2: Parallel de novo Assembly of Complex (Meta) Genomes via …aydin/HiCOMB.pdfParallel de novo Assembly of Complex (Meta) Genomes via HipMer ... EvangelosGeorganas, Steven Hofmeyr, Lenny

Jointwork(alphabetical)withChaitanyaAluru,JarrodChapman,RobEgan,Evangelos Georganas,StevenHofmeyr,LennyOliker,DanRokhsar,KathyYelick

1. ParallelDeBruijn GraphConstructionandTraversalfordenovoGenomeAssembly,SC’14

2. Awhole-genomeshotgunapproachforassemblingandanchoringthehexaploid breadwheatgenome.GenomeBiology,16(26),2015.

3. meraligner:Afullyparallelsequencealigner,IPDPS’154. HipMer:AnExtreme-ScaleDeNovoGenomeAssembler,SC’15

OutlineandAcknowledgments

Workisfundedby

Page 3: Parallel de novo Assembly of Complex (Meta) Genomes via …aydin/HiCOMB.pdfParallel de novo Assembly of Complex (Meta) Genomes via HipMer ... EvangelosGeorganas, Steven Hofmeyr, Lenny

DenovoGenomeAssembly

1.Threecopiesofthesamenovel.

For all men tragically great are made so through a certain morbidness… all mortal greatness is but disease.

2.Sometextfromthenovel.Allpageswillberandomlycutintostripsofcharacters.Randomtypos(errors) throughouteachnovel.

For a

all men tragically g

ally great

great are made so

3.Afewstripsofcharactersfromonepage.

4.Allofthestripsofcharactersfromthe3novels.

5.Everystripmustbeassembledasshownheretocreateasinglecopyofthenovel.

For all men tragically great are made so

For a ally great

great are made so all men tragically g

1.ThreecopiesofthesameDNA.

2.SomepartoftheDNAsequence.Itwillbereadintostrips.Therearerandomerrors throughoutthesequence.

ACCGTAGCAAAACCGGGTAGTCATACTACTACGTACTCATCT

3.Thesequenceisreadintosmallerpieces(reads).CannotreadwholeDNAsequenceinonego.

ACCGTAGCAA

AAACCGGGTA TAGTCATACT

AAACCGGGTA

ACTACGTACT

4.Allreads

5.ReconstructoriginalDNAsequencefromthereadset.

ACCGTAGCAAAACCGGGTAGTCATACTACTACGTACTCATCT

ACCGTAGCAA

AAACCGGGTA

GTAGTCATACT

CTACTACGTAC

CGTACTCATCT

Page 4: Parallel de novo Assembly of Complex (Meta) Genomes via …aydin/HiCOMB.pdfParallel de novo Assembly of Complex (Meta) Genomes via HipMer ... EvangelosGeorganas, Steven Hofmeyr, Lenny

• There is no genome reference! – In principle we want to reconstruct unknown genome sequence.

• Reads are significantly shorter than whole genome.– Reads consist of 20 to 30K bases– Genomes vary in length and complexity – up to 30G bases

• Reads include errors.

• Genomes have repetitive regions.– Repetitive regions increase genome complexity.

DenovoGenomeAssemblyishard

Page 5: Parallel de novo Assembly of Complex (Meta) Genomes via …aydin/HiCOMB.pdfParallel de novo Assembly of Complex (Meta) Genomes via HipMer ... EvangelosGeorganas, Steven Hofmeyr, Lenny

• Switchgrass:1.4Giga-base pairs(Gbp)• Maize:2.4Gbp• Miscanthus:2.5Gbp• Humangenome:3Gbp• Barleygenome:7Gbp• Wheatgenome:17Gbp• Pinegenome:20Gbp• Salamander:20-30Gbp

Genomesvaryinsize

Page 6: Parallel de novo Assembly of Complex (Meta) Genomes via …aydin/HiCOMB.pdfParallel de novo Assembly of Complex (Meta) Genomes via HipMer ... EvangelosGeorganas, Steven Hofmeyr, Lenny

reads

contigs

scaffolds

k-mers

1

2

3

Input:Readsthatmaycontainerrors

Chopreadsintok-mers,processk-mers toexcludeerrors

Construct&traversedeBruijn graphofk-mers,generatecontigs

Leveragereadinformationtolinkcontigs andgeneratescaffolds.

GenomeAssemblyalaMeraculous

I/O,bandwidth,andmemoryintensive

Latencybound(irregularaccesses)

ComputeandI/Ointensive

Page 7: Parallel de novo Assembly of Complex (Meta) Genomes via …aydin/HiCOMB.pdfParallel de novo Assembly of Complex (Meta) Genomes via HipMer ... EvangelosGeorganas, Steven Hofmeyr, Lenny

MeraculousparallelizationwithUPC

• UPC is a PGAS (Partitioned Global Address Space) parallel language with one-sided communication.

• Core graph algorithms implemented in UPC (graph ó hash table).

• We need the notion of a huge global distributed hash table.• Irregular access pattern in the the distributed hash table

– One-sided communication is handy!• Portable implementation: can run any machine without change!• Result of this work: Complete assembly of human genome in

8.4 minutes using 15K cores

• Original code required 2 days and a large memory machine.

Page 8: Parallel de novo Assembly of Complex (Meta) Genomes via …aydin/HiCOMB.pdfParallel de novo Assembly of Complex (Meta) Genomes via HipMer ... EvangelosGeorganas, Steven Hofmeyr, Lenny

Reads

Parallelk-mer analysis:pass1

P0

P1

Pn

Parsetok-mers

Hyperloglog

Hyperloglog

Hyperloglog

k-mer cardinalityestimation

G=Globalcardinalityestimation

ParallelReduction

BloomFilterofsize(G/n)

BloomFilterofsize(G/n)

BloomFilterofsize(G/n)

CreatelocalBloomfilters

Alsoidentifyhigh-frequencyk-mers

atthisstep

Page 9: Parallel de novo Assembly of Complex (Meta) Genomes via …aydin/HiCOMB.pdfParallel de novo Assembly of Complex (Meta) Genomes via HipMer ... EvangelosGeorganas, Steven Hofmeyr, Lenny

Reads

Parallelk-mer analysis:pass2

P0

P1

Pn

Parsetok-mers

Hashk-mers

Hashk-mers

Hashk-mers

Hashk-mers &findowners

Receivedk-mers

Receivedk-mers

Receivedk-mers

All-to-allcommunication

ofk-mers

BloomFilter

BloomFilter

BloomFilter

Localset

Localset

Localset

Storeak-mer inthelocalsetonlyifitwas

seenbefore

Thelocalsetdoestheactualcounting

ofthek-mers

Page 10: Parallel de novo Assembly of Complex (Meta) Genomes via …aydin/HiCOMB.pdfParallel de novo Assembly of Complex (Meta) Genomes via HipMer ... EvangelosGeorganas, Steven Hofmeyr, Lenny

Bloomfilterisaprobabilistic datastructureusedformembershipqueries• Givenabloomfilter,wecanask:“Haveweseenthisk-mer before?”• Nofalsenegatives.• Mayhavefalsepositives(inpractice5%falsepositiverate)

k-mers withfrequency=1areuseless(eithererrororcannotbedistinguishedfromerror),andcansafelybeeliminated.

WhyuseaBloomfilter?

Page 11: Parallel de novo Assembly of Complex (Meta) Genomes via …aydin/HiCOMB.pdfParallel de novo Assembly of Complex (Meta) Genomes via HipMer ... EvangelosGeorganas, Steven Hofmeyr, Lenny

Reads

Parallelk-mer analysis:pass3

P0

P1

Pn

Parsetok-mers

Hashk-mers

Hashk-mers

Hashk-mers

Findextensionsofk-mers,hashthem&findtheirowners

Receivedk-mers &extensions

Receivedk-mers &extensions

Receivedk-mers &extensions

All-to-allcommunicationofk-mers andextensions

Keeptrackofthenumberofoccurrencesofeachextensionfor

eachk-mer

Localset

Localset

Localset

ACCCACTCTTAGCFAACCTTGCGCATXA

AGGCAATGGTAGFFAAAATTGCCCATXX

TTCCAGTTTTGCCAAACTTGGCTTTTCA

Useathreshold&findthehigh

qualityextensionsofk-mers

Page 12: Parallel de novo Assembly of Complex (Meta) Genomes via …aydin/HiCOMB.pdfParallel de novo Assembly of Complex (Meta) Genomes via HipMer ... EvangelosGeorganas, Steven Hofmeyr, Lenny

High-frequencyk-mers

Long-taileddistributionforgenomeswithrepetitivecontent:- Themaximumcountforanyk-mer inthewheatdatasetis451million- Ouroriginalscheme(SC’14)was“ownercounts”,afteranall-to-all- Countinganitemw/451millionoccurrencesaloneisloadimbalanced

Solution:Quicklyidentifyhigh-frequencyk-mers usingminimalcommunicationduringthe“cardinalityestimation”stepandtreatthemspeciallybyusinglocalcounters.

Page 13: Parallel de novo Assembly of Complex (Meta) Genomes via …aydin/HiCOMB.pdfParallel de novo Assembly of Complex (Meta) Genomes via HipMer ... EvangelosGeorganas, Steven Hofmeyr, Lenny

• InMeraculous,thedeBruijn graphisrepresentedasahashtable

• K-mers arebothnodesinthegraph&keysinthehashtable

• Anedgeinthegraphconnectstwonodesthatoverlapink-1bases

• Theedgesinthegraphareputinthehashtablebystoringtheextensionsofthek-mers astheircorrespondingvalues

GAT ATC TCT CTG TGA

AAC

ACC

CCG

AAT

ATG

TGC

ParallelDeBruijn GraphConstruction

Page 14: Parallel de novo Assembly of Complex (Meta) Genomes via …aydin/HiCOMB.pdfParallel de novo Assembly of Complex (Meta) Genomes via HipMer ... EvangelosGeorganas, Steven Hofmeyr, Lenny

ParallelDeBruijn GraphConstruction

• Challenge 1: The hash table that represents the de Bruijngraph is large (100s of GBs up to 10s of TBs !)– Solution: Distribute the graph over multiple processors.

The global address space of UPC is handy!

• Challenge 2: Parallel hash tables construction introduces communication and synchronization costs– Solution: Aggregate messages to reduce number of

messages and synchronization à 10x-20x performance improvement.

Page 15: Parallel de novo Assembly of Complex (Meta) Genomes via …aydin/HiCOMB.pdfParallel de novo Assembly of Complex (Meta) Genomes via HipMer ... EvangelosGeorganas, Steven Hofmeyr, Lenny

Pi

LocalbufferforP0

LocalbufferforP1

LocalbufferforPn

BufferlocaltoP0

DistributedHashtable

LocaltoP0

LocaltoP0

LocaltoP0

LocaltoP0

Aggregatingstoresoptimization

P0

P0 storesthek-mers &extensionsinitslocalbucketsinalock-free&communication-freefashion

Page 16: Parallel de novo Assembly of Complex (Meta) Genomes via …aydin/HiCOMB.pdfParallel de novo Assembly of Complex (Meta) Genomes via HipMer ... EvangelosGeorganas, Steven Hofmeyr, Lenny

Goal:• TraversethedeBruijn graphandfindUUcontigs (chainsofUU

nodes),oralternatively• findtheconnectedcomponentswhichconsistoftheUUcontigs.

• Mainidea:– Pickaseed– Iterativelyextenditbyconsecutivelookupsinthedistributedhash

table

GAT ATC TCT CTG TGA

AAC

ACC

CCG

AAT

ATG

TGC

Contig 1:GATCTGAContig 2:AACCG

Contig 3:AATGC

ParallelDeBruijn GraphTraversal

Page 17: Parallel de novo Assembly of Complex (Meta) Genomes via …aydin/HiCOMB.pdfParallel de novo Assembly of Complex (Meta) Genomes via HipMer ... EvangelosGeorganas, Steven Hofmeyr, Lenny

CGTATTGCCAATGCAACGTATCATGGCCAATCCGAT

Assumeone oftheUUcontigs tobeassembledis:

ParallelDeBruijn GraphTraversal

Page 18: Parallel de novo Assembly of Complex (Meta) Genomes via …aydin/HiCOMB.pdfParallel de novo Assembly of Complex (Meta) Genomes via HipMer ... EvangelosGeorganas, Steven Hofmeyr, Lenny

ParallelDeBruijn GraphTraversal

CGTATTGCCAATGCAACGTATCATGGCCAATCCGAT

ProcessorPi picksarandomk-mer fromthedistributedhashtableasseed:

Pi knowsthatforwardextensionisA

Pi usesthelastk-1basesandtheforwardextensionandforms:CAACGTATCA

Pi doesalookupinthedistributedhashtableforCAACGTATCA

Pi iteratesthisprocessuntilitreachesthe“right”endpointoftheUUcontig

Pi alsoiteratesthisprocessbackwardsuntilitreachesthe“left”endpointoftheUUcontig

Page 19: Parallel de novo Assembly of Complex (Meta) Genomes via …aydin/HiCOMB.pdfParallel de novo Assembly of Complex (Meta) Genomes via HipMer ... EvangelosGeorganas, Steven Hofmeyr, Lenny

MultipleprocessorsonthesameUUcontig

CGTATTGCCAATGCAACGTATCATGGCCAATCCGAT

However,processorsPi,Pj andPt mighthavepickedinitialseedsfromthesameUUcontig

Pi Pj Pt

• ProcessorsPi,Pj andPt havetocollaborateandconcatenatesubcontigsinordertoavoidredundantwork.

• Solution:lightweightsynchronizationschemebasedonastatemachine

Page 20: Parallel de novo Assembly of Complex (Meta) Genomes via …aydin/HiCOMB.pdfParallel de novo Assembly of Complex (Meta) Genomes via HipMer ... EvangelosGeorganas, Steven Hofmeyr, Lenny

reads

contigs

scaffolds

k-mers

1

2

3PARALLELIZINGSCAFFOLDING

AlignmentforDenovο GenomeAssembly

Tobereusedformetagenomicsw/modifications

Page 21: Parallel de novo Assembly of Complex (Meta) Genomes via …aydin/HiCOMB.pdfParallel de novo Assembly of Complex (Meta) Genomes via HipMer ... EvangelosGeorganas, Steven Hofmeyr, Lenny

Read 42

Contig 101 Contig 61

Read ID start-pos end-pos Contig ID start-pos end-pos--------------------------------------------------------------------------------------------------Read 42 1 4 Contig 101 152 155Read 42 130 150 Contig 61 1 21Read 90 1 150 Contig 500 101 250

Read 90

Contig 500

Queries(Reads)

Targets(Contigs)

• Aqueryandatargetshouldmatchinatleastkbasesinordertobealigned• Wecallseed asubstringofasequence(queryortarget)withlengthequaltok

Aligningqueriestotargetsequences

Page 22: Parallel de novo Assembly of Complex (Meta) Genomes via …aydin/HiCOMB.pdfParallel de novo Assembly of Complex (Meta) Genomes via HipMer ... EvangelosGeorganas, Steven Hofmeyr, Lenny

A C T G G G G C A Target 0: Target 1:

Seed Index

Buildingseedindex

Page 23: Parallel de novo Assembly of Complex (Meta) Genomes via …aydin/HiCOMB.pdfParallel de novo Assembly of Complex (Meta) Genomes via HipMer ... EvangelosGeorganas, Steven Hofmeyr, Lenny

Seed: ACT

A C T G G G G C A Target 0: Target 1:

Seed Index

Buildingseedindex

Page 24: Parallel de novo Assembly of Complex (Meta) Genomes via …aydin/HiCOMB.pdfParallel de novo Assembly of Complex (Meta) Genomes via HipMer ... EvangelosGeorganas, Steven Hofmeyr, Lenny

Seed: ACT

�Seed: CTG

A C T G G G G C A Target 0: Target 1:

Seed Index

Buildingseedindex

Page 25: Parallel de novo Assembly of Complex (Meta) Genomes via …aydin/HiCOMB.pdfParallel de novo Assembly of Complex (Meta) Genomes via HipMer ... EvangelosGeorganas, Steven Hofmeyr, Lenny

Seed: ACT

�Seed: CTG

�Seed: TGG

A C T G G G G C A Target 0: Target 1:

Seed Index

Buildingseedindex

Page 26: Parallel de novo Assembly of Complex (Meta) Genomes via …aydin/HiCOMB.pdfParallel de novo Assembly of Complex (Meta) Genomes via HipMer ... EvangelosGeorganas, Steven Hofmeyr, Lenny

Seed: ACT

�Seed: GGC

�Seed: CTG

�Seed: TGG

A C T G G G G C A Target 0: Target 1:

Seed Index

Buildingseedindex

Page 27: Parallel de novo Assembly of Complex (Meta) Genomes via …aydin/HiCOMB.pdfParallel de novo Assembly of Complex (Meta) Genomes via HipMer ... EvangelosGeorganas, Steven Hofmeyr, Lenny

Seed: ACT

�Seed: GGC

�Seed: CTG

�Seed: TGG

�Seed: GCA

A C T G G G G C A Target 0: Target 1:

Seed Index

Buildingseedindex

Page 28: Parallel de novo Assembly of Complex (Meta) Genomes via …aydin/HiCOMB.pdfParallel de novo Assembly of Complex (Meta) Genomes via HipMer ... EvangelosGeorganas, Steven Hofmeyr, Lenny

Seed: ACT

�Seed: GGC

�Seed: CTG

�Seed: TGG

�Seed: GCA

A C T G G G G C A Target 0: Target 1:

Seed Index

query sequence : G C T G

query’s seed

Lookupseedindex

SeedGCTisnotfoundinseedindex

Page 29: Parallel de novo Assembly of Complex (Meta) Genomes via …aydin/HiCOMB.pdfParallel de novo Assembly of Complex (Meta) Genomes via HipMer ... EvangelosGeorganas, Steven Hofmeyr, Lenny

Seed: ACT

�Seed: GGC

�Seed: CTG

�Seed: TGG

�Seed: GCA

A C T G G G G C A Target 0: Target 1:

Seed Index

query sequence : G C T G

query’s seed

Lookupseedindex

Page 30: Parallel de novo Assembly of Complex (Meta) Genomes via …aydin/HiCOMB.pdfParallel de novo Assembly of Complex (Meta) Genomes via HipMer ... EvangelosGeorganas, Steven Hofmeyr, Lenny

• Indenovoassembly,billionsofreadsmustbealignedtocontigs• Firstalignertoparallelizetheseedindexconstruction(“fully”parallel)

Evangelos Georganas, Aydın Buluç,JarrodChapman,LeonidOliker,DanielRokhsar,andKatherineYelick.meraligner:Afullyparallelsequencealigner.InProceedingsoftheIPDPS,2015.

128

256

512

1024

2048

4096

8192

16384

480 960 1920 3840 7680 15360

Seco

nds

Number of Cores

merAligner-wheatideal-wheatmerAligner-humanideal-humanBWAmem-humanBowtie2-human

ParallelGenomeAlignmentforDenovoAssembly

Page 31: Parallel de novo Assembly of Complex (Meta) Genomes via …aydin/HiCOMB.pdfParallel de novo Assembly of Complex (Meta) Genomes via HipMer ... EvangelosGeorganas, Steven Hofmeyr, Lenny

Scaffoldingbeyondalignment

• High-level idea: Leverage read-to-contig information to generate links among contigs.• Distributed hash tables to index the link information.

• Formacontig graphbyusingthelinksandtraverseittoformscaffolds.

contig i contig j

LINK : < contig i ! contig j , link info >

contig 1 contig 2 contig 3 contig 4

tie 1!2 tie 2!3 tie 3!4

Scaffold Link 1!2 Link 2!3 Link 3!4

Page 32: Parallel de novo Assembly of Complex (Meta) Genomes via …aydin/HiCOMB.pdfParallel de novo Assembly of Complex (Meta) Genomes via HipMer ... EvangelosGeorganas, Steven Hofmeyr, Lenny

Scaffoldingbeyondalignment

• Computingcontig depthsandterminationstates• Contig bubbleidentification(fordiploidgenomes)• Orderingandorientationofcontigs(inherentlyserialasimplemented)• Insertsizeestimation• Locatingsplints andspans• Contig linkgeneration• GapclosingConceptually:Mini/localassembly– embarrassinglyparallelizable-

Toolsemployed:- Distributedhashtables- Speculativeexecution

read contig k

contig m

SPLINT

read 1 read 2

contig i

contig j

SPAN (a) (b)

Gap 1 Gap 2 Gap 3

Task 1 Task 2 Task 3

Gap closing

Assembled genome

Page 33: Parallel de novo Assembly of Complex (Meta) Genomes via …aydin/HiCOMB.pdfParallel de novo Assembly of Complex (Meta) Genomes via HipMer ... EvangelosGeorganas, Steven Hofmeyr, Lenny

Strongscaling(wheatgenome)onCrayXC30

32

64

128

256

512

1024

2048

4096

8192

16384

960 1920 3840 7680 15360

Seco

nds

Number of Cores

overall timekmer analysis

contig generationscaffolding

ideal overall time

• Completeassemblyofwheatgenomein39minutes(15Kcores).• OriginalMeraculous wouldrequire(projectedtime)aweek(~300x

slower) andasharedmemorymachinewith1TBmemory.

Page 34: Parallel de novo Assembly of Complex (Meta) Genomes via …aydin/HiCOMB.pdfParallel de novo Assembly of Complex (Meta) Genomes via HipMer ... EvangelosGeorganas, Steven Hofmeyr, Lenny

Strongscaling(humangenome)onCrayXC30@SC’15

• Completeassemblyofhumangenomein8.4minutes(15Kcores).• 350xspeedupoveroriginalMeraculous(took2,880minutesanda

largesharedmemorymachine).

32

64

128

256

512

1024

2048

4096

8192

480 960 1920 3840 7680 15360

Seco

nds

Number of Cores

overall timekmer analysis

contig generationscaffolding

ideal overall time

Page 35: Parallel de novo Assembly of Complex (Meta) Genomes via …aydin/HiCOMB.pdfParallel de novo Assembly of Complex (Meta) Genomes via HipMer ... EvangelosGeorganas, Steven Hofmeyr, Lenny

Strongscaling(humangenome)onCrayXC30@Now

• Completeassemblyofhumangenomein4 minutesusing23Kcores.• 700xspeedupoveroriginalMeraculous (took2,880minutesandalarge

sharedmemorymachine).

16

32

64

128

256

512

1024

2048

4096

8192

1920 3840 7680 15360 23040

Seco

nds

Number of Cores

ideal IO cached overall No IO cached

overall IO cachedkmer analysis

contig generationscaffolding

IO

Page 36: Parallel de novo Assembly of Complex (Meta) Genomes via …aydin/HiCOMB.pdfParallel de novo Assembly of Complex (Meta) Genomes via HipMer ... EvangelosGeorganas, Steven Hofmeyr, Lenny

• Insteadofmultiplecopiesofthesamebook,howyouhavethewholelibrarytoassemble!

Howaboutmetagenomes?

• 1984isthe“dominantspecies”inthissample.• Tooeasy: therearealsopreviouslyunknownbooksinthemix.

Option1:Binthereadstogenomes;runsinglegenomeassemblyL Readsaretooshorttocontainenoughinformationforbinning.Option2:Generatecontigs andbinthecontigsL Howdoyoueliminateerrorsforlow-coverageorganisms?

Page 37: Parallel de novo Assembly of Complex (Meta) Genomes via …aydin/HiCOMB.pdfParallel de novo Assembly of Complex (Meta) Genomes via HipMer ... EvangelosGeorganas, Steven Hofmeyr, Lenny

Metagenomechallenges

- Whydoesmetagenomeassemblybenefitsfromexpliciterrorcorrectionwhensinglegenomeassemblydoesnot?+Unevensequencingdepth:Errorsinhigh-depthregionsmightbemorefrequentthantruek-mers inlow-depthregions.- Whocaresaboutlow-depthregions?+Infact,that’swhatmattersmost.

Singlegenome Metagenome

Page 38: Parallel de novo Assembly of Complex (Meta) Genomes via …aydin/HiCOMB.pdfParallel de novo Assembly of Complex (Meta) Genomes via HipMer ... EvangelosGeorganas, Steven Hofmeyr, Lenny

AdaptingHipMertometagenomes

persistent

contiggeneration

alignment scaffolding

B

Datasizes

gapclosing

B

E

G

E

k-meranalysis Increasek

anditerate

A10TB

1TB

100GB

A

CCC

D

C

A

C

D

F

D

Inputs:(A) Shortreadsand(F)longinsert(matepaired)librariesIntermediate:(B)Error-freek-mers w/extensions(a.k.a.UFX),(C)contigs…Output:(G)finalgenomescaffolds

Secretmetagenomesauce#1

Page 39: Parallel de novo Assembly of Complex (Meta) Genomes via …aydin/HiCOMB.pdfParallel de novo Assembly of Complex (Meta) Genomes via HipMer ... EvangelosGeorganas, Steven Hofmeyr, Lenny

DEGASBerkeleymeeting

k-mer analysis

De Bruijn graph traversal

Bubble merging and hair removal

Iterative graph pruning

Local Assembly

Iterate for larger k

bubble

hair

Iterativecontig generation

Page 40: Parallel de novo Assembly of Complex (Meta) Genomes via …aydin/HiCOMB.pdfParallel de novo Assembly of Complex (Meta) Genomes via HipMer ... EvangelosGeorganas, Steven Hofmeyr, Lenny

DEGASBerkeleymeeting

k-mer analysis

De Bruijn graph traversal

Bubble merging and hair removal

Iterative graph pruning

Local Assembly

Iterate for larger k

bubble

hair

Iterativecontig generation

Page 41: Parallel de novo Assembly of Complex (Meta) Genomes via …aydin/HiCOMB.pdfParallel de novo Assembly of Complex (Meta) Genomes via HipMer ... EvangelosGeorganas, Steven Hofmeyr, Lenny

k-mer analysis

De Bruijn graph traversal

Bubble merging and hair removal

Iterative graph pruning

Local Assembly

Iterate for larger k

bubble

hair

Iterativecontig generation

Page 42: Parallel de novo Assembly of Complex (Meta) Genomes via …aydin/HiCOMB.pdfParallel de novo Assembly of Complex (Meta) Genomes via HipMer ... EvangelosGeorganas, Steven Hofmeyr, Lenny

k-mer analysis

De Bruijn graph traversal

Bubble merging and hair removal

Iterative graph pruning

Local Assembly

Iterate for larger k

bubble

hair

Iterativecontig generation

Page 43: Parallel de novo Assembly of Complex (Meta) Genomes via …aydin/HiCOMB.pdfParallel de novo Assembly of Complex (Meta) Genomes via HipMer ... EvangelosGeorganas, Steven Hofmeyr, Lenny

k-mer analysis

De Bruijn graph traversal

Bubble merging and hair removal

Iterative graph pruning

Local Assembly

Iterate for larger k

bubble

hair

Iterativecontig generation

Page 44: Parallel de novo Assembly of Complex (Meta) Genomes via …aydin/HiCOMB.pdfParallel de novo Assembly of Complex (Meta) Genomes via HipMer ... EvangelosGeorganas, Steven Hofmeyr, Lenny

k-mer analysis

De Bruijn graph traversal

Bubble merging and hair removal

Iterative graph pruning

Local Assembly

Iterate for larger k

bubble

hair

Iterativecontig generation

Page 45: Parallel de novo Assembly of Complex (Meta) Genomes via …aydin/HiCOMB.pdfParallel de novo Assembly of Complex (Meta) Genomes via HipMer ... EvangelosGeorganas, Steven Hofmeyr, Lenny

k-mer analysis

De Bruijn graph traversal

Bubble merging and hair removal

Iterative graph pruning

Local Assembly

Iterate for larger k

bubble

hair

Iterativecontig generation

Page 46: Parallel de novo Assembly of Complex (Meta) Genomes via …aydin/HiCOMB.pdfParallel de novo Assembly of Complex (Meta) Genomes via HipMer ... EvangelosGeorganas, Steven Hofmeyr, Lenny

HipMeronaMOCKcommunity

– Mixof25bacteriawithdifferentabundancies (JGIdataset)– Datasetsize~100GBytes

Statistics metaHipMer metaSPAdes

#contigs 2,670 5,100

Totallength> 10kbp 92,245,198 92,152,728

Totallength>50kbp 69,493,125 77,292,121

Misassemblies 58 134

Mismatchesper100kbp 3.48 77.05

Genomesfraction(%) 92.10 91.10

Page 47: Parallel de novo Assembly of Complex (Meta) Genomes via …aydin/HiCOMB.pdfParallel de novo Assembly of Complex (Meta) Genomes via HipMer ... EvangelosGeorganas, Steven Hofmeyr, Lenny

Summary

• HipMer’s corealgorithmsscaletotensofthousandsofcoresandyieldperformanceimprovementsfromdays/weeksdowntominutes.

• HipMerbreaksthehardwarelimitationsbyenablingdistributed-memoryscaling

• Useofdenovoassemblyintimesensitiveapplicationslikeprecisionmedicineisnomoreformidable!

• SourcereleaseofHipMer:https://sourceforge.net/projects/hipmer/

• Ongoingwork:AdaptHipMerformetagenomicanalysis(somehintsandpreliminaryresultsgiveninthistalk).