next generation sequence data and de novo assembly for human genetics by jaap van der heijden
TRANSCRIPT
![Page 1: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649eb45503460f94bbb9b8/html5/thumbnails/1.jpg)
Next generation sequence data and de novo assembly
For human genetics
By Jaap van der Heijden
![Page 2: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649eb45503460f94bbb9b8/html5/thumbnails/2.jpg)
De novo assembly
![Page 3: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649eb45503460f94bbb9b8/html5/thumbnails/3.jpg)
Overall idea
![Page 4: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649eb45503460f94bbb9b8/html5/thumbnails/4.jpg)
Overall idea
![Page 5: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649eb45503460f94bbb9b8/html5/thumbnails/5.jpg)
Repeats and non random sheering
![Page 6: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649eb45503460f94bbb9b8/html5/thumbnails/6.jpg)
scaffolding
•Multiple libraries•contigs are directed by mate pairs
-> scaffolding
![Page 7: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649eb45503460f94bbb9b8/html5/thumbnails/7.jpg)
4 types of assemblers
Greedy algorithms
Overlap-layout-consensus
Align-layout-consensus
Bac by Bac sequencing
![Page 8: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649eb45503460f94bbb9b8/html5/thumbnails/8.jpg)
Types of assemblers I
Greedy algorithms joins similar reads
easily confused by repeats
![Page 9: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649eb45503460f94bbb9b8/html5/thumbnails/9.jpg)
Types of assemblers II
Overlap layout consensus assembler nodes represent end of read
lines represent similarity between reads (overlap)
layout step removes redundant information
consensus step is building of genome
![Page 10: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649eb45503460f94bbb9b8/html5/thumbnails/10.jpg)
Types of assemblers III
Align-layout-consensus. process called comparative assembly.
The overlap stage of assembly is replaced by an alignment step.
The layout stage is also greatly simplified due to the additional constraints provided by the alignment to the reference.
![Page 11: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649eb45503460f94bbb9b8/html5/thumbnails/11.jpg)
Types of assemblers IV
Bac by bac sequencing genome broken in fragments
Bac’s location is determined in the lab
minimum tiling path (whole genome is covered by at least one Bac
Bac’s sequenced
![Page 12: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649eb45503460f94bbb9b8/html5/thumbnails/12.jpg)
Lander-Waterman equation
“rain drops” to cover a tile
8-10 fold coverage 5 contigs for 1MB
genome
![Page 13: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649eb45503460f94bbb9b8/html5/thumbnails/13.jpg)
Timeline 1975 Sanger sequencing
1990 First shotgun/EST assemblers overlap-layout-consensus approach
2000 Human shotgun assembly
2001 Mouse shotgun assembly
2005 454 roche available
2006 Solexa available
2007 short read assembers de Bruijn graphs
![Page 14: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649eb45503460f94bbb9b8/html5/thumbnails/14.jpg)
The complexity of sequence assembly
• Long reads– better identification– much slower
• Short reads– faster to align– more difficult with
repeats
• Amount of reads
• Length of reads
• Mismatches
• Algorithms can show quadratic or even exponential complexity
![Page 15: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649eb45503460f94bbb9b8/html5/thumbnails/15.jpg)
ABySS genomes Solexa, SOLiD Simpson, J. et al.
AMOS genomes Sanger, 454 Salzberg, S. et al.
Arachne WGA (large) genomes Sanger Batzoglou, S. et al.
CAP3, PCAP genomes Sanger Huang, X. et al.
Celera WGA Assembler / CABOG (large) genomes Sanger, 454, Solexa Myers, G. et al.; Miller G. et al.
CLC Genomics Workbench genomesSanger, 454, Solexa, SOLiD CLC bio
CodonCode Aligner BACs (,genomes?) Sanger CodonCode Corporation
Euler genomesSanger, 454 (,Solexa ?) Pevzner, P. et al.
Euler-sr genomes 454, Solexa Chaisson, MJ. et al.
MIRA, miraEST genomes, ESTs Sanger, 454, Solexa Chevreux, B.NextGENe (small genomes?) 454, Solexa, SOLiD SoftgeneticsNewbler genomes 454, Sanger 454/Roche
Phrap genomes Sanger, 454 Green, P.
TIGR Assembler genomic Sanger -
Sequencher (small) genomes Sanger Gene Codes CorporationSeqMan NGen (small) genomes Sanger, 454, Solexa DNASTAR
SHARCGS (small) genomes Solexa Dohm et al.
SSAKE (small) genomesSolexa (SOLiD? Helicos?) Warren, R. et al.
Staden gap4 package BACs (, small genomes?) Sanger Staden et al.
VCAKE (small) genomesSolexa (SOLiD?, Helicos?) Jeck, W. et al.
Phusion assembler (large) genomes Sanger Mullikin JC, et.al.Quality Value Guided SRA (QSRA) genomes Sanger, Solexa Bryant DW, et.al.
Velvet (algorithm) (small) genomesSanger, 454, Solexa, SOLiD Zerbino, D. et al.
![Page 16: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649eb45503460f94bbb9b8/html5/thumbnails/16.jpg)
3 NGS Projects
Dragon fly
Medical Maggots
EST comparison
![Page 17: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649eb45503460f94bbb9b8/html5/thumbnails/17.jpg)
Dragon Fly (libelle)
Class Odonata
3000 species 90 in Europe
Undergo a morphic change
![Page 18: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649eb45503460f94bbb9b8/html5/thumbnails/18.jpg)
Pilot study for African Dragon Fly
Morphic change
Some migrate others don't
Genetically divergent
Contain lots of introns in their genome
![Page 19: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649eb45503460f94bbb9b8/html5/thumbnails/19.jpg)
Project questions
What are the homologies with other species?
How big is the genome?
Are there already sequences in Genbank and are they present in the data?
![Page 20: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649eb45503460f94bbb9b8/html5/thumbnails/20.jpg)
Dragon fly project data
Genomic Single end
1 x 1147762 reads
Trimmed to 34/51 nucleotides
39.023.908 nucleotides sequenced
CDNA Paired end
2 x 1291901 reads
Read lenght = 51
131.773.902 nucleotides sequenced
![Page 21: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649eb45503460f94bbb9b8/html5/thumbnails/21.jpg)
Dragon fly methods
Assemble cDNA
Blast resulting contigs to determine homologies
Align genomic DNA to contigs
Calculate genome size
![Page 22: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649eb45503460f94bbb9b8/html5/thumbnails/22.jpg)
Dragon fly assembly results
total contigs: 3898 average length of contigs: 176
average coverage of contigs: 24
contigs larger than 300 nucleotides: 800
average length of contigs larger then 300: 508
average coverage of contigs larger then 300: 15
![Page 23: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649eb45503460f94bbb9b8/html5/thumbnails/23.jpg)
Dragon fly genes and homologies
libellula pulchella
Enallagma aspersum
Erythromma najas
Ischnura verticalis
many Drosophila species Criteria used for in this analysis was an e-
value of less then 1*10^-40 and a score of more than 200.
COII gene with accession number GQ256052.1 (partial)
COI gene with accession number GQ256032.1 (partial)
NDI gene with accession number GQ255994.1 (partial) found in the cDNA contigs.
![Page 24: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649eb45503460f94bbb9b8/html5/thumbnails/24.jpg)
Dragon fly genome size
30 genomic genes selected after blasting
Size 300-1500
Alignment with Bowtie
“calculation”
![Page 25: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649eb45503460f94bbb9b8/html5/thumbnails/25.jpg)
Medicinal maggots
Treated to non healing wounds
genes revealed Signaling proteins
Inhibitor of apoptosis protein 2
Digestive enzymes Lipases proteinases
antimicrobial peptides (AMPs) Lucilia defensin diptericin
![Page 26: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649eb45503460f94bbb9b8/html5/thumbnails/26.jpg)
Medicinal maggots data
5 degenerate peptide sequences 36 Peptides
cDNA 8.199.983 reads
read lenght 32
2.623.994.560
![Page 27: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649eb45503460f94bbb9b8/html5/thumbnails/27.jpg)
Medicinal maggots question
Have we sequenced (pieces) of the genes corresponding to the peptides.
![Page 28: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649eb45503460f94bbb9b8/html5/thumbnails/28.jpg)
Medicinal maggots methods
Build local library of peptides
Assemble contigs CLCbio
Nextgene
Velvet
Blast contigs to peptides
Find hits
Make coverage plot
![Page 29: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649eb45503460f94bbb9b8/html5/thumbnails/29.jpg)
Nextgene assembly maggots
aantal contigs = 59048
gemiddelde lengte = 59
gemiddelde coverage = 11
aantal contigs >300 = 719
gemiddelde lengte >300 = 661
gemiddelde coverage >300 = 64
![Page 30: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649eb45503460f94bbb9b8/html5/thumbnails/30.jpg)
CLC assembly
Aantal contigs = 78
gemiddelde lengte = 2282
gemiddelde coverage = 514
![Page 31: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649eb45503460f94bbb9b8/html5/thumbnails/31.jpg)
Velvet assembly made
total contigs: 586
length of contigs: 168
coverage of contigs: 55
contigs larger than 300 nucleotides: 62
length of contigs larger then 300: 779
coverage of contigs larger then 300: 63
![Page 32: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649eb45503460f94bbb9b8/html5/thumbnails/32.jpg)
Found Genes Maggots
C.vicina mRNA for arylphorin subunit A4
Velvet Drosophila willistoni GK21455 (Dwil\GK21455) mRNA
nextgene Lucilia cuprina clone sbsp9 serine proteinase mRNA
nextgene
![Page 33: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649eb45503460f94bbb9b8/html5/thumbnails/33.jpg)
EST comparison
Traditional EST sequencing
known library
assemblers CLCbio
Nextgene
Velvet
![Page 34: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649eb45503460f94bbb9b8/html5/thumbnails/34.jpg)
EST comparison method
Assemble cDNA and match with known ESTs
![Page 35: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649eb45503460f94bbb9b8/html5/thumbnails/35.jpg)
EST results
![Page 36: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649eb45503460f94bbb9b8/html5/thumbnails/36.jpg)
conclusions
Big differences between assemblers coverage
length
amount of nodes
sequence
x performs best on EST test
![Page 37: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649eb45503460f94bbb9b8/html5/thumbnails/37.jpg)
Questions?