genome sequence assembly concepts and methods shih-jon wang may 13, 2008
DESCRIPTION
Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008. OUTLINE. Assembly Process Overview Assembly algorithms Repeats Scaffolding Phred/Phrap/Consed Assembly pipelines. Assembly process overview. A Genome Sequencing Project. Building a Library. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/1.jpg)
Genome sequence assembly
concepts and methods
Shih-Jon Wang
May 13, 2008
![Page 2: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/2.jpg)
• Assembly Process Overview• Assembly algorithms• Repeats• Scaffolding • Phred/Phrap/Consed• Assembly pipelines
OUTLINE
![Page 3: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/3.jpg)
Assembly process overview
![Page 4: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/4.jpg)
A Genome Sequencing Project
![Page 5: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/5.jpg)
Building a Library
• Break DNA into random fragments (8-10x)
![Page 6: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/6.jpg)
SHOTGUNs
• Whole Genome Shotgun
• Bac-Bac Shotgun
• Size of inserts:
• --Bac insert: ~150KB
• --Fosmid insert: ~30KB
• --Normal insert: ~3KB
![Page 7: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/7.jpg)
Clone and scaffold
(a) Clone inserts are sequenced from both ends, yielding mated sequence reads. (b) A scaffold uses linking information provided by the clone-pairing data to order and orient contiguous sequences,
or contigs, in the genome under assembly.
Computer 35 (7):47-54
![Page 8: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/8.jpg)
Building a Library
• Break DNA into random fragments (~10x)
• Break DNA into random fragments (~10x)-- Amplify the fragments in a vector-- Sequence 800-1000 bases at each end
![Page 9: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/9.jpg)
Assembling the fragments
![Page 10: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/10.jpg)
Assembling the fragments
• Break DNA into random fragments• Sequence the ends of the fragments• Assemble the sequenced ends
![Page 11: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/11.jpg)
Forward-reverse constraints• The sequenced ends are facing towards each other• The distance between the two fragments is known
![Page 12: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/12.jpg)
Building Scaffolds
![Page 13: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/13.jpg)
Assembly Gaps
--sequencing gap: know the order & orientation of the contigs and have at least one clone spanning the gap--physical gap: no information about adjacent contigs, nor about the DNA spanning the gap
![Page 14: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/14.jpg)
Finishing the Project
![Page 15: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/15.jpg)
Unifying View of Assembly
![Page 16: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/16.jpg)
Assembly Algorithms
![Page 17: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/17.jpg)
Assembly Methods
• Overlap-layout-consensus
– greedy (Phrap, CAP3, TIGR...)
– graph-based (Euler)
![Page 18: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/18.jpg)
Phrap/CAP3
Greedy • Build a rough map of fragment overlaps• Pick the largest scoring overlap• Merge the two fragments• Repeat until no more merges can be done!!! IDEAL CASE !!!
![Page 19: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/19.jpg)
Real World Problems
• Sequencing errors
• Chimera
• Repeats
• Contaminants
• Polymorphism
• Orientation
![Page 20: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/20.jpg)
Error Correction
![Page 21: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/21.jpg)
Overlap b/w two sequences
![Page 22: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/22.jpg)
All pairs alignment
• Try all pairs – must consider ~ n^2 pairs• Smarter solution: only n x coverage (e.g. 8) pairs
are possible
– Build a table of k-mers contained in sequences (single pass through the genome)
– Generate the pairs from k-mer table (single pass
through k-mer table)
![Page 23: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/23.jpg)
Repeats
![Page 24: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/24.jpg)
Repeat sequence
The top represents the correct layout of three DNA sequences. The bottom shows a repeat collapsed in a misassembly.
Computer 35 (7):47-54
![Page 25: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/25.jpg)
重覆序列■ 重覆頻率分
Interspersed repeats Short interspersed element (SINE), eg. Alu <300 bp Long interspersed element (LINE), ca. 5 kb
Tandem repeats Satellite DNA Minisat. & Variable number of tandem repeats Microsat.: mono-, di-, tri-, tetra-nucleotide
■ 重覆方向分 同向重覆序列 反向重覆序列
![Page 26: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/26.jpg)
Repeat detection
Pre-assembly: find fragments that belong to repeats
• statistically (Reps)
• repeat database (RepeatMasker)
![Page 27: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/27.jpg)
Statistical repeat detection
• Significant deviations from average coverage flagged as repeats.
- frequent k-mers are ignored- “arrival” rate of reads in contigs compared with
theoretical value(e.g., 800 bp reads & 8x coverage - reads "arrive" e
very 100 bp)• Problem 1: assumption of uniform distribution of
fragments - leads to false positives non-random libraries poor clonability regions
• Problem 2: repeats with low copy number are missed
![Page 28: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/28.jpg)
Scaffolding
![Page 29: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/29.jpg)
Sequencing hierarchy
• Random sequencing– unrelated reads ~700 pairs• Assembly– un-related contigs 5K-10K pairs• Scaffolding– unrelated scaffolds 30K~ 50K pairs• Finishing/gap closure– completed genomes millions-billions of bas
e-pairs
![Page 30: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/30.jpg)
Definition
![Page 31: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/31.jpg)
Scaffolder output
• order and orientation of contigs• size of gaps between contigs• linking evidence: mate-pairs spanning gaps
![Page 32: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/32.jpg)
Clone-mates
![Page 33: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/33.jpg)
Linking information
![Page 34: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/34.jpg)
Hierarchical scaffolding
![Page 35: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/35.jpg)
Ambiguous scaffold
![Page 36: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/36.jpg)
Phred/Phrap/Consed Analysis
![Page 37: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/37.jpg)
What is Phred/Phrap/Consed ?
Phred/Phrap/Consed is a worldwide distributed package for: a. Trace file (chromatograms) reading;b. Quality (confidence) assignment to each individual base;c. Vector & repeat sequences identification and masking;d. Sequence assembly;e. Assembly visualization and editing;f. Automatic finishing.
![Page 38: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/38.jpg)
How to deal with the enormous amount of reads generated by
the high throughput DNA sequencers?
![Page 39: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/39.jpg)
Phred Genome Research 8: 175-194
![Page 40: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/40.jpg)
PhredPhred is a program that performs several
tasks:
a. Reads trace files – compatible with most file formats: SCF (standard chromatogram format), ABI, ESD (MegaBACE) and LI-COR.
b. Calls bases – attributes a base for each identified peak with a lower errorrate than the staard base calling programs.
![Page 41: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/41.jpg)
Phred
c. Assigns quality values to the bases – a “Phred value” based on an error rate estimation calculated for each individual base.
d. Creates output files – base calls and quality values are written to output files.
![Page 42: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/42.jpg)
File Directories
• chromat_dir/
• edit_dir/
• phd_dir/
![Page 43: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/43.jpg)
Trace File High quality region – no ambiguities (Ns)
![Page 44: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/44.jpg)
Trace File Medium quality region – some
ambiguities (Ns)
![Page 45: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/45.jpg)
Trace File Poor quality region – low confidence
![Page 46: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/46.jpg)
Phred value formula
q = - 10 x log10 (p) whereq - quality valuep - estimated probability error for a base call
Examples:Examples:
qq = 20 means = 20 means pp = 10 = 10-2-2 (1 error in 100 (1 error in 100 bases)bases)qq = 40 means = 40 means pp = 10 = 10-4-4 (1 error in 10,000 (1 error in 10,000 bases)bases)
![Page 47: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/47.jpg)
Base Calling
• phred -id . -p -pd ../phd_dir
• phred -view pf84c05.s1
![Page 48: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/48.jpg)
The structure of a phd file BEGIN_SEQUENCE 01EBV10201A02.g
BEGIN_COMMENT
CHROMAT_FILE: EBV10201A02.gABI_THUMBPRINT: PHRED_VERSION: 0.990722.gCALL_METHOD: phredQUALITY_LEVELS:99TIME: Thu May 24 00:18:58 2001TRACE_ARRAY_MIN_INDEX: 0TRACE_ARRAY_MAX_INDEX: 12153TRIM: CHEM: termDYE: big
END_COMMENT
BEGIN_DNAt 8 5c 13 17a 19 26c 19 32
t 6 11908t 6 11908a 6 11921a 6 11921g 6 11927g 6 11927t 6 11947t 6 11947c 6 11953c 6 11953a 6 11964a 6 11964g 6 11981g 6 11981c 4 11994c 4 11994n 4 12015n 4 12015c 4 12037c 4 12037n 4 12044n 4 12044n 4 12058n 4 12058n 4 12071n 4 12071n 4 12085n 4 12085n 4 12098n 4 12098n 4 12111n 4 12111n 4 12124n 4 12124c 4 12144c 4 12144n 4 12151n 4 12151END_DNAEND_DNA END_SEQUENCEEND_SEQUENCE
t 24 2221t 24 2221a 24 2232a 24 2232a 22 2245a 22 2245a 27 2261a 27 2261g 25 2272g 25 2272c 19 2286c 19 2286c 12 2302c 12 2302t 19 2314t 19 2314g 12 2324g 12 2324g 15 2331g 15 2331g 19 2346g 19 2346g 23 2363g 23 2363t 33 2378t 33 2378g 36 2390g 36 2390c 44 2404c 44 2404c 44 2419c 44 2419t 39 2433t 39 2433a 39 2446a 39 2446a 34 2460a 34 2460t 35 2470t 35 2470g 34 2482g 34 2482
t 16 8191t 16 8191g 19 8200g 19 8200t 13 8211t 13 8211c 13 8229c 13 8229g 4 8241g 4 8241n 4 8253n 4 8253c 4 8263c 4 8263t 10 8276t 10 8276t 9 8286t 9 8286c 12 8301c 12 8301t 16 8313t 16 8313c 12 8329c 12 8329c 12 8336c 12 8336c 15 8343c 15 8343t 19 8356t 19 8356c 9 8371c 9 8371g 13 8386g 13 8386g 14 8397g 14 8397a 7 8417a 7 8417g 9 8427g 9 8427g 4 8445g 4 8445
![Page 49: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/49.jpg)
phd2fasta• phd2fasta program
– –converts .phdfiles to sequence in multifasta format
– –writes .qualfile (quality scores) for each trace file – –phd2fasta -id ../phd_dir -os CLONE.fasta -oq
CLONE.fasta.qual
• Output: – –fasta.seqcontains fastasequences – –fasta.seq.qualcontains quality scores
![Page 50: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/50.jpg)
Vector Sequence Cleaning (1)
• DNA sequence cleaning: quality trimming and vector removal---Lucy:
• Lucy Steps: – Read input seq#, seq info, and quality info– Chop off splice sites– Remove vector insert– Produce output seq for fragment assembly.
![Page 51: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/51.jpg)
Vector Sequence Cleaning (2)• Restriction on file name:can’t contain any symbol eg. “–” “. “ “_”• Lucy major parameters to set up:-vector vector_completeSeq splice_site_file
(splice_site_file: 2 splice-site seq before and after the insertion point on the vector)
• Lucy Output: – identified locations of good/clean region – trim seq without vector, linker, Ns (<3% Ns)
![Page 52: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/52.jpg)
splice_site_file• ~ 150 bases, 50 bases overlap around splice • >PUCsplice.for.begingattaagttgggtaacgccagggttttcccagtcacgacgttgtaaaacgacggccagtgccaagcttgcatgcctgcaggtcgactctagaggatccccgggtaccgagctcgaattcgtaatcatggtcatagctgtttcctgtgtga
• >PUCsplice.for.endacggccagtgccaagcttgcatgcctgcaggtcgactctagaggatccccgggtaccgagctcgaattcgtaatcatggtcatagctgtttcctgtgtgaaattgttatccgctcacaattccacacaacatacgagccggaagcataaa
• >PUCsplice.rev.begintttatgcttccggctcgtatgttgtgtggaattgtgagcggataacaatttcacacaggaaacagctatgaccatgattacgaattcgagctcggtacccggggatcctctagagtcgacctgcaggcatgcaagcttggcactggccgt
• >PUCsplice.rev.endtcacacaggaaacagctatgaccatgattacgaattcgagctcggtacccggggatcctctagagtcgacctgcaggcatgcaagcttggcactggccgtcgttttacaacgtcgtgactgggaaaaccctggcgttacccaacttaatc
![Page 53: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/53.jpg)
Cross_match
![Page 54: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/54.jpg)
Cross_match
• cross_match -minmatch 12 -penalty -2 -minscore 20 -screen CLONE.fasta /net/share/sequence_pipeline/vector.fasta
![Page 55: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/55.jpg)
Phrap-- Phragment Assembly Program (or Phil’s Revise
d Assembly Program)• Phrap is a program for assembling shotgun DNA Phrap is a program for assembling shotgun DNA
sequence data sequence data • Key Features:Key Features:• a. Uses the entire read content – no need fa. Uses the entire read content – no need f
or trimming.or trimming.• b. User supplied (i.e. Repbase) + internally b. User supplied (i.e. Repbase) + internally
computed data – better accuracy of assembly in computed data – better accuracy of assembly in the presence of repeats.the presence of repeats.
• c. Contig sequence is constituted by a mosc. Contig sequence is constituted by a mosaic of the highest quality parts of the reads – it’s aic of the highest quality parts of the reads – it’s not a consensus! not a consensus!
![Page 56: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/56.jpg)
Phrap
--Phrap is a program for assembling shotgun DNA --Phrap is a program for assembling shotgun DNA sequence data sequence data
• d. Provides extensive information about assembld. Provides extensive information about assembly – contained in phrap.out, *.ace and *.screen.coy – contained in phrap.out, *.ace and *.screen.contigs.qual filesntigs.qual files
• e. Handles very large datasets – hundreds of thoe. Handles very large datasets – hundreds of thousands of reads are easily manipulated.usands of reads are easily manipulated.
• f. Generate output files – contain some important f. Generate output files – contain some important data and enable visualization by other programsdata and enable visualization by other programs
![Page 57: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/57.jpg)
Banded Search
![Page 58: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/58.jpg)
K-mers
• >GL1234.b1
gattaagttgggtaacgccagggttttcccagtcac…
gattaagttgggta
attaagttgggtaa
ttaagttgggtaac
taagttgggtaacg
...
![Page 59: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/59.jpg)
Phrap output files
• *.contigs – fasta file containing the contigs*.contigs – fasta file containing the contigs– Contigs with more than one readContigs with more than one read– Singletons (single reads with a match to some other Singletons (single reads with a match to some other
contig but that couldn’t be merged consistently to it)contig but that couldn’t be merged consistently to it)
• *.singlets – fasta file of the singlet reads*.singlets – fasta file of the singlet reads– Reads with no match to other readReads with no match to other read
• *.ace – allows for viewing the assembly using C*.ace – allows for viewing the assembly using Consedonsed
• *.view – required for viewing the assembly usin*.view – required for viewing the assembly using Phrapviewg Phrapview
![Page 60: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/60.jpg)
Phrap parameters• phrap -new_ace CLONE.fasta.screen >outfile
• OPTIONS DEFAULT FUNCTION• -penalty -2 ↑=>↑Stringency
. -gap_init penalty-2
. -gap_ext penalty-1• -minmatch* 14 ↑=>↓time↓Matches• -bandwidth 14 ↓=>↓time↓String. • -minscore 30 ↑=>↑String.• *highly sensitive! bigger genomes bigger value
![Page 61: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/61.jpg)
Phrap parameters
• OPTIONS DEFAULT FUNCTION• -forcelevel 0~10 ↓=>↑String. • -repeat_stringency 0.95• 0<x<1 ↑=>↑String.• -force_high* ↑=>↑String.• -revise_greedy** ↓Misassembly• -shatter_greedy** ↓ContigLength
* Ignore edited high-quality discrepancies**break assembly at weak joins
![Page 62: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/62.jpg)
Phrap parameters
• OPTIONS DEFAULT FUNCTION• -max_subclone_size• 5000 F.-R. check• -default_qual 15• -preassemble*• -group_delim* _
*used together
![Page 63: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/63.jpg)
Consed Genome Research 8: 195-202, 1998
![Page 64: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/64.jpg)
Consed
A program for viewing and editing assemblies A program for viewing and editing assemblies produced by Phrapproduced by Phrap
Key Features:Key Features:
a. Assembly viewer - allows for visualization of contigs, aa. Assembly viewer - allows for visualization of contigs, assembly (aligned reads), quality values of reads and fissembly (aligned reads), quality values of reads and final sequence. nal sequence.
b. Trace file viewer – single and multiple trace files can bb. Trace file viewer – single and multiple trace files can be visualized allowing for comparison of a given sequene visualized allowing for comparison of a given sequence in several reads.ce in several reads.
![Page 65: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/65.jpg)
Consed
A program for viewing and editing assemblieA program for viewing and editing assemblies produced by Phraps produced by Phrap
Key Features:Key Features:
c. Navigation – identify and list regions which are below a gc. Navigation – identify and list regions which are below a given quality threshold, contain high quality discrepanciesiven quality threshold, contain high quality discrepancies, single-strand coverage, etc., single-strand coverage, etc.
d. Autofinish – automatic set of functions for: gap closure, d. Autofinish – automatic set of functions for: gap closure, improvement of sequence quality, determination of relatiimprovement of sequence quality, determination of relative orientation of contigs, identification of regions covereve orientation of contigs, identification of regions covered by a single read or by reads of a single strand. The prod by a single read or by reads of a single strand. The program automatically performs primer picking and chooses gram automatically performs primer picking and chooses the templates.the templates.
![Page 66: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/66.jpg)
![Page 67: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/67.jpg)
![Page 68: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/68.jpg)
![Page 69: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/69.jpg)
![Page 70: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/70.jpg)
![Page 71: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/71.jpg)
![Page 72: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/72.jpg)
Phred/Phrap/Consed Pipeline
Chromat_dirChromat_dir
Phd_dirPhd_dir
Edit_dirEdit_dir
DirectoriesDirectories::
Assembly view ing/editingConsed
Assem blyPhrapassem bled contigs - se qs_ fas ta .sc re en .con tigsassem bly file - seq s_ fa s ta .sc re e n .a ce#
Vector screening and m askingCross_M atch (local alignment program) x vec to r.seqscreened/masked file - seq s_fa s ta .scre en
Conversion - phd to fastaphd2fasta.plnucleotide sequences - seq s_fa s taquality values - seq s_ fa s ta .sc re e n .q u a l
Quality (confidence) values assignm entPhredphd files - * .p hd
Inputchromatogram files
![Page 73: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/73.jpg)
Comparison of shotgun sequence data from Wolbachia genome Project
Computer 35 (7):47-54
![Page 74: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/74.jpg)
CAP3 3XA 6189 57 443
PHRAP 3XA 6396 54 529
CAP3 3XB 12,368 44 71
PHRAP 3XB 13,116 47 228
CAP3 3XC 10,709 49 227
PHRAP 3XC 11,406 45 332
CAP3 3XD 11,408 43 115
PHRAP 3XD 11,350 49 240
CAP3 5XA 10,582 42 249
PHRAP 5XA 18,268 31 252
CAP3 5XB 26,034 17 100
PHRAP 5XB 33,693 18 115
CAP3 5XC 20,939 29 172
PHRAP 5XC 20,912 27 261
CAP3 5XD 14,219 35 46
PHRAP 5XD 14,696 33 129
![Page 75: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/75.jpg)
CAP3 8XA 71,025 12 83
PHRAP 8XA 71,395 8 80
CAP3 8XB 53,127 8 59
PHRAP 8XB 53,078 7 36
CAP3 8XC 52,134 8 4
PHRAP 8XC 76,922 6 6
CAP3 8XD 72,690 7 35
PHRAP 8XD 102,523 6 60
CAP3 10XA 91,380 4 28
PHRAP 10XA 91,329 3 11
CAP3 10XB 167,655 1 5
PHRAP 10XB 138,551 2 7
CAP3 10XC 106,631 5 44
PHRAP 10XC 77,747 4 12
CAP3 10XD 79,900 4 2
PHRAP 10XD 79,978 3 2
![Page 76: Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008](https://reader036.vdocuments.us/reader036/viewer/2022062422/5681369e550346895d9e42ff/html5/thumbnails/76.jpg)
Softwares
• CAP3 (for EST): http://genome.cs.mtu.edu/cap/cap3.html
• Phrap (for large genome): http://www.phrap.org
• --Similar algorithm• --Insufficient documentation and support• -- Always have to write scripts to parse outpu
ts• --NO PERFECT PROGRAM!!!