problems of genome assembly james yorke and aleksey zimin university of maryland, college park 1

Problems of Problems of Genome Genome AssemblyAssembly

James Yorke and Aleksey ZiminJames Yorke and Aleksey ZiminUniversity of Maryland, College ParkUniversity of Maryland, College Park

1

WGS sequencingWGS sequencing

Multiple copies of DNAFragments of 200 - 200,000 bases

No information is retained on which part of the DNA the fragments came from.

2

WGS sequencing: WGS sequencing: fragmentsfragments

Sequencing machine reads 100-1000 bases on the ends of the fragments, producing pairs of reads. The fragment sizes are known up to ± 10-20%.

CAAGCTGAT...

Pair of reads

Unknown sequence…GTTTGGAAC

3

The mathematical The mathematical problemproblem

We start with millions of pairs of We start with millions of pairs of reads, 100 - 1000 bases eachreads, 100 - 1000 bases each

Multiple copies of DNA provide Multiple copies of DNA provide multiple multiple coveragecoverage by reads by reads

The problem of genome assembly The problem of genome assembly is to recover the original is to recover the original sequence of bases of the genome sequence of bases of the genome (as much as possible…). (as much as possible…). 4

Assembling a jigsaw Assembling a jigsaw puzzle 1puzzle 1

The task of the assembly becomes The task of the assembly becomes the task of assembling a giant the task of assembling a giant jigsaw puzzlejigsaw puzzle

We look for reads whose sequences We look for reads whose sequences suggest that they came from the suggest that they came from the same place in the genome:same place in the genome:

AGTGATTAAGTGATTAGATGATAGTAGAGATGATAGTAGA |||||||||||||||||||||||| GATGATAGTAGAGATGATAGTAGAGGATAGATTTAGGATAGATTTA

5


Then we put “overlapping” reads Then we put “overlapping” reads togethertogether

AGTGATTAGTGATTAGATGATAGTAGAAGATGATAGTAGA

AGATGATAGTAGAAGATGATAGTAGAGGATAGATAGACCATAGATAGACC

ATAGATAGACCATAGATAGACCACTCATCATACACTCATCATAC

AGTGATTAGATGATAGTAGAGATAGATAGACCACTCATCATACAGTGATTAGATGATAGTAGAGATAGATAGACCACTCATCATAC

reads

This yields a “contig”

6


We use read pairing information to We use read pairing information to order and orient contigs to produce order and orient contigs to produce scaffoldsscaffolds – the final product of – the final product of assemblyassembly

Pairs of reads belonging to the same fragment of DNA

contig contig

7

Difficulties in assemblyDifficulties in assembly

Sequencing errors: two reads that came from the same place in the genome often have mismatching sequences AGTGATTAGATCATAGTAGAG || ||||||||| ATGATAGTAGAGGATAGAT

Repetitive DNA (~ 5-20% of human DNA is repetitive): TTAGGGTTAGGGTTAGGGTTAGGGTTAGGG 8

Repeat regions may cause Repeat regions may cause omissionsomissions

A R B R C

A R C

9

Erroneous duplicationsErroneous duplications

UMD2

BosTau4

Each base in the genome is covered by 6 reads, on average. A way to judge which assembly is correct is to compute the average read coverage for these regions.

Two recent published assemblies of the cow genome: Two recent published assemblies of the cow genome: UMD2 and BosTau4 UMD2 and BosTau4

Segmental duplications were a central theme in Segmental duplications were a central theme in BosTau4 genome paperBosTau4 genome paper

UMD2 assembly had many fewer duplicationsUMD2 assembly had many fewer duplications

We examined the duplications, > 99.5% identity, We examined the duplications, > 99.5% identity, >5000bp, one copy in the UMD2 assembly and two >5000bp, one copy in the UMD2 assembly and two copies in the BosTau4copies in the BosTau4

10

Examining read coverage Examining read coverage reveals errorsreveals errors

The thick solid vertical line is placed at the coverage at which it is as likely to have two copies as it is to have one.

11

Next Gen vs. Sanger Next Gen vs. Sanger SequencingSequencing

Sanger sequencing for a mammalian (~ 3Gbp) Sanger sequencing for a mammalian (~ 3Gbp) genomegenome Expensive: $50M for a mammalian genomeExpensive: $50M for a mammalian genome Large amount of DNA requiredLarge amount of DNA required We get 700-1000 bp reads all with mate pairsWe get 700-1000 bp reads all with mate pairs

Illumina and 454 Sequencing for the same Illumina and 454 Sequencing for the same genomegenome Inexpensive: as low as $25K (Illumina), or $1M Inexpensive: as low as $25K (Illumina), or $1M

(454) for a mammalian genome(454) for a mammalian genome Small amount of DNA required (e.g. one insect)Small amount of DNA required (e.g. one insect) Only 100 or 400 bp reads, some with mate pairsOnly 100 or 400 bp reads, some with mate pairs

Assembly is a much harder problem nowAssembly is a much harder problem now

12

Difficulties in Difficulties in denovodenovo Assembly of Illumina and Assembly of Illumina and

454 data454 data Reads are short – high coverage needed, imposing Reads are short – high coverage needed, imposing

demanding requirements on the software and computer demanding requirements on the software and computer hardwarehardware

Error patterns in the reads: Error patterns in the reads: substitution errors in Illumina readssubstitution errors in Illumina reads homopolymer errors (unable to tell AAAA from AAA) in 454 readshomopolymer errors (unable to tell AAAA from AAA) in 454 reads

Biased coverage by Illumina reads depending on the CG Biased coverage by Illumina reads depending on the CG contentcontent

Unreliable mate pairs:Unreliable mate pairs:

Assembly techniques have much larger impact nowAssembly techniques have much larger impact now

could actually be

13

NGS AssemblersNGS Assemblers

New assemblers developed for New assemblers developed for different kinds of NGS data:different kinds of NGS data: Newbler for 454 data Newbler for 454 data SOAPdenovo, Velvet, ABYSS, ALLPATHS, SOAPdenovo, Velvet, ABYSS, ALLPATHS,

and others for Illumina dataand others for Illumina data We use open source Celera Assembler We use open source Celera Assembler

currently supported by J. Craig Venter currently supported by J. Craig Venter Institute bioinformatics teamInstitute bioinformatics team

CA is capable of assembling mixed CA is capable of assembling mixed data setsdata sets

14

Assembly quality varies Assembly quality varies significantly with the significantly with the

software usedsoftware used Example 1:Example 1: Argentine ant assembly comparison. Argentine ant assembly comparison. Both assemblies used the same 75bp Illumina Both assemblies used the same 75bp Illumina

reads, unmated and in 3kb and 8kb mate pairsreads, unmated and in 3kb and 8kb mate pairs

SOAPdenovo CA 5.4 Improvement

Sequence in assembly

137 Mbp 171 Mbp 25%

N50 Scaffold size 139 bp 386,149 bp 3000 times

N50 Contig size 139 bp 3,367 bp 24 times

15

Assembly quality varies Assembly quality varies significantly with the significantly with the

software usedsoftware used Example 2.Example 2. Pogonomyrmex barbatus, the Red Pogonomyrmex barbatus, the Red

Harvester Ant assembly comparison (454 data). Harvester Ant assembly comparison (454 data). Both assemblies used the same 454 data in 3kb Both assemblies used the same 454 data in 3kb

mate pairs, 8kb mate pairs and shotgun readsmate pairs, 8kb mate pairs and shotgun reads

Newbler CA 5.3 Improvement

Sequence in assembly

194 Mbp 220 Mbp 13%

N50 Scaffold size 47 Kbp 794 Kbp 17 times

N50 Contig size 2 Kbp 12 Kbp 6 times

16

Post-assembly stepsPost-assembly steps

Assemblers output scaffolds – ordered and Assemblers output scaffolds – ordered and oriented collections of contigs.oriented collections of contigs.

Scaffolds typically are much smaller than Scaffolds typically are much smaller than chromosomes and may contain large-scale chromosomes and may contain large-scale errorserrors..

Some mate pair linking information remains Some mate pair linking information remains unused by assemblers.unused by assemblers.

Marker maps, i.e. collections of short Marker maps, i.e. collections of short sequences whose positions on the sequences whose positions on the chromosomes are known, can be used to chromosomes are known, can be used to position the contigs on the chromosomes.position the contigs on the chromosomes.

18

UMD Chromosome UMD Chromosome builderbuilder

Uses contigs, mate pairs and markers, Uses contigs, mate pairs and markers, discarding unreliable scaffold discarding unreliable scaffold informationinformation

Mapping steps:Mapping steps: Use mate pairs to orient contigsUse mate pairs to orient contigs Use markers and mate pairs to assign Use markers and mate pairs to assign

oriented contigs to the chromosomesoriented contigs to the chromosomes Compute position of each contig on the Compute position of each contig on the

chromosome as the best least-square fit to chromosome as the best least-square fit to the available mate pair and marker datathe available mate pair and marker data

19

Computing contig Computing contig orientationsorientations

An orientation problem:An orientation problem:

A B C

20



A B C

21



Matrix form:Matrix form:

Compute y, the eigenvector corresponding to the Compute y, the eigenvector corresponding to the largest eigenvalue of M. The signs of the largest eigenvalue of M. The signs of the eigenvector components provide recipe to flipping eigenvector components provide recipe to flipping the contigs to achieve consistent orientationsthe contigs to achieve consistent orientations

A B C

A B C

0 1 -1 A

M = 1 0 -1 B

-1 -1 0 C

22


The eigenvector of M corresponding to the largest The eigenvector of M corresponding to the largest eigenvalue, or Frobenius – Perron eigenvalue eigenvalue, or Frobenius – Perron eigenvalue =2: =2: y=(0.5774, 0.5774, -0.5774). y=(0.5774, 0.5774, -0.5774).

sign(y) = (1, 1, -1), that is the solution is to flip contig sign(y) = (1, 1, -1), that is the solution is to flip contig CC

Final matrix of orientations = diag(sign(y))*M* Final matrix of orientations = diag(sign(y))*M* diag(sign(y)):diag(sign(y)):

Flipping C is the correct solution!Flipping C is the correct solution!

1 0 0 0 1 -1 1 0 0 0 1 1

0 1 0 1 0 -1 0 1 0 = 1 0 1

0 0 -1 -1 -1 0 0 0 -1 1 1 0

23

ConclusionsConclusions

Genome assembly is a difficult Genome assembly is a difficult problem that has gotten harder problem that has gotten harder because of Next Gen Sequencing databecause of Next Gen Sequencing data

Assembly techniques have large Assembly techniques have large impact on the quality of the assemblyimpact on the quality of the assembly

Output of the assembler is not the Output of the assembler is not the final assembly; extensive post-final assembly; extensive post-processing is required to produce processing is required to produce chromosome sequenceschromosome sequences

24

problems of genome assembly james yorke and aleksey zimin university of maryland, college park 1

Documents