![Page 1: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright 1996-2001. All rights reserved](https://reader035.vdocuments.us/reader035/viewer/2022062308/56649ea35503460f94ba7f4d/html5/thumbnails/1.jpg)
Computational Biology, Part 6Sequence File Formats and
Sequence Assembly
Computational Biology, Part 6Sequence File Formats and
Sequence Assembly
Robert F. MurphyRobert F. Murphy
Copyright Copyright 1996-2001. 1996-2001.
All rights reserved.All rights reserved.
![Page 2: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright 1996-2001. All rights reserved](https://reader035.vdocuments.us/reader035/viewer/2022062308/56649ea35503460f94ba7f4d/html5/thumbnails/2.jpg)
Sequence file formatsSequence file formats Two characteristics of file formatsTwo characteristics of file formats
texttext or or binarybinary minimalminimal or or annotatedannotated
TextText files use IUB codes and are readable by a files use IUB codes and are readable by a word word processor processor (e.g., (e.g., SimpleTextSimpleText, , Microsoft WordMicrosoft Word) or ) or text editor text editor (e.g., (e.g., emacsemacs))
BinaryBinary files are usually readable only by the files are usually readable only by the program that created them (e.g., program that created them (e.g., MacVectorMacVector))
AnnotatedAnnotated files preserve information known about files preserve information known about the sequence (coding region start/stop, protein the sequence (coding region start/stop, protein features, literature references, etc.)features, literature references, etc.)
![Page 3: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright 1996-2001. All rights reserved](https://reader035.vdocuments.us/reader035/viewer/2022062308/56649ea35503460f94ba7f4d/html5/thumbnails/3.jpg)
Sequence file formatsSequence file formats
ASCII (text) ASCII (text) MinimalMinimal
Line, Plain TextLine, Plain Text StadenStaden FASTAFASTA Bionet (allows comments)Bionet (allows comments)
AnnotatedAnnotated GenBankGenBank GCGGCG
Binary (usually annotated)Binary (usually annotated) MacVectorMacVector
![Page 4: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright 1996-2001. All rights reserved](https://reader035.vdocuments.us/reader035/viewer/2022062308/56649ea35503460f94ba7f4d/html5/thumbnails/4.jpg)
Examples of ASCII sequence file formatsExamples of ASCII sequence file formats Line (Line (MacVectorMacVector), Plain Text (), Plain Text (AssemblyLIGNAssemblyLIGN))CCAAGAAGAAGAAGACCCCAGCGAGGAAAATGTGCTGGAGACCCCTGTGCCGGTTCCTGTGGCTTTGGTCCCAAGAAGAAGAAGACCCCAGCGAGGAAAATGTGCTGGAGACCCCTGTGCCGGTTCCTGTGGCTTTGGTCCTATCTGTCCTATGTTCAAGCTGTGCCTATCCACAAAGTCCAGGATGACACCAAAACCCTCATCAAGACCCTATCTGTCCTATGTTCAAGCTGTGCCTATCCACAAAGTCCAGGATGACACCAAAACCCTCATCAAGACCATTGTCACCAGGATCAATGACATTTCACACACGCAGTCGGTATCCGCCAGGCAGAGGGTCACCGGTTTGGATTGTCACCAGGATCAATGACATTTCACACACGCAGTCGGTATCCGCCAGGCAGAGGGTCACCGGTTTGGACTTCATTCCCGGGCTTCACCCCATTCTGAGTTTGTCCAAGATGGACCAGACCCTGGCAGTCTATCAACAACTTCATTCCCGGGCTTCACCCCATTCTGAGTTTGTCCAAGATGGACCAGACCCTGGCAGTCTATCAACAGATCCTCACCAGCTTGCCTTCCCAAAACGTGCTGCAGATAGCTCATGACCTGGAGAACCTGCGAGACCTCGATCCTCACCAGCTTGCCTTCCCAAAACGTGCTGCAGATAGCTCATGACCTGGAGAACCTGCGAGACCTCCTCCATCTGCTGGCCTTCTCCAAGAGCTGCTCCCTGCCGCAGACCCGTGGCCTGCAGAAGCCAGAGAGCCCTCCATCTGCTGGCCTTCTCCAAGAGCTGCTCCCTGCCGCAGACCCGTGGCCTGCAGAAGCCAGAGAGCCTGGATGGCGTCCTGGAAGCCTCGCTCTACTCCACAGAGGTGGTGGCTCTGAGCAGGCTGCAGGGCTCTCTTGGATGGCGTCCTGGAAGCCTCGCTCTACTCCACAGAGGTGGTGGCTCTGAGCAGGCTGCAGGGCTCTCTGCAGGACATTCTTCAACAGTTGGACCTTAGCCCTGAATGCTGAGGTTTCGCAGGACATTCTTCAACAGTTGGACCTTAGCCCTGAATGCTGAGGTTTC
![Page 5: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright 1996-2001. All rights reserved](https://reader035.vdocuments.us/reader035/viewer/2022062308/56649ea35503460f94ba7f4d/html5/thumbnails/5.jpg)
Examples of ASCII sequence file formatsExamples of ASCII sequence file formats Fasta (Fasta (EntrezEntrez))>gi|995614|dbj|D49653|RATOBESE Rat mRNA for obese.>gi|995614|dbj|D49653|RATOBESE Rat mRNA for obese.CCAAGAAGAAGAAGACCCCAGCGAGGAAAATGTGCTGGAGACCCCTGTGCCGGTTCCTGTGGCTTTGGTCCCAAGAAGAAGAAGACCCCAGCGAGGAAAATGTGCTGGAGACCCCTGTGCCGGTTCCTGTGGCTTTGGTCCTATCTGTCCTATGTTCAAGCTGTGCCTATCCACAAAGTCCAGGATGACACCAAAACCCTCATCAAGACCCTATCTGTCCTATGTTCAAGCTGTGCCTATCCACAAAGTCCAGGATGACACCAAAACCCTCATCAAGACCATTGTCACCAGGATCAATGACATTTCACACACGCAGTCGGTATCCGCCAGGCAGAGGGTCACCGGTTTGGATTGTCACCAGGATCAATGACATTTCACACACGCAGTCGGTATCCGCCAGGCAGAGGGTCACCGGTTTGGACTTCATTCCCGGGCTTCACCCCATTCTGAGTTTGTCCAAGATGGACCAGACCCTGGCAGTCTATCAACAACTTCATTCCCGGGCTTCACCCCATTCTGAGTTTGTCCAAGATGGACCAGACCCTGGCAGTCTATCAACAGATCCTCACCAGCTTGCCTTCCCAAAACGTGCTGCAGATAGCTCATGACCTGGAGAACCTGCGAGACCTCGATCCTCACCAGCTTGCCTTCCCAAAACGTGCTGCAGATAGCTCATGACCTGGAGAACCTGCGAGACCTCCTCCATCTGCTGGCCTTCTCCAAGAGCTGCTCCCTGCCGCAGACCCGTGGCCTGCAGAAGCCAGAGAGCCCTCCATCTGCTGGCCTTCTCCAAGAGCTGCTCCCTGCCGCAGACCCGTGGCCTGCAGAAGCCAGAGAGCCTGGATGGCGTCCTGGAAGCCTCGCTCTACTCCACAGAGGTGGTGGCTCTGAGCAGGCTGCAGGGCTCTCTTGGATGGCGTCCTGGAAGCCTCGCTCTACTCCACAGAGGTGGTGGCTCTGAGCAGGCTGCAGGGCTCTCTGCAGGACATTCTTCAACAGTTGGACCTTAGCCCTGAATGCTGAGGTTTCGCAGGACATTCTTCAACAGTTGGACCTTAGCCCTGAATGCTGAGGTTTC
![Page 6: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright 1996-2001. All rights reserved](https://reader035.vdocuments.us/reader035/viewer/2022062308/56649ea35503460f94ba7f4d/html5/thumbnails/6.jpg)
Examples of ASCII sequence file formatsExamples of ASCII sequence file formats GenBank (GenBank (Entrez, MacVectorEntrez, MacVector))LOCUS RATOBESE 539 bp ss-mRNA ROD 23-SEP-1995LOCUS RATOBESE 539 bp ss-mRNA ROD 23-SEP-1995DEFINITION Rat mRNA for obese.DEFINITION Rat mRNA for obese.ACCESSION D49653ACCESSION D49653KEYWORDS .KEYWORDS .SOURCE Rattus norvegicus (strain OLETF, LETO and Zucker, ) differentiatedSOURCE Rattus norvegicus (strain OLETF, LETO and Zucker, ) differentiated adipose cDNA to mRNA.adipose cDNA to mRNA. ORGANISM Rattus norvegicusORGANISM Rattus norvegicus Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata;Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata; Vertebrata; Sarcopterygii; Mammalia; Eutheria; Rodentia;Vertebrata; Sarcopterygii; Mammalia; Eutheria; Rodentia; Sciurognathi; Myomorpha; Muridae; Murinae; Rattus.Sciurognathi; Myomorpha; Muridae; Murinae; Rattus.REFERENCE 1 (bases 1 to 539)REFERENCE 1 (bases 1 to 539) AUTHORS Murakami,T. and Shima,K.AUTHORS Murakami,T. and Shima,K. TITLE Cloning of rat obese cDNA and its expression in obese ratsTITLE Cloning of rat obese cDNA and its expression in obese rats JOURNAL Biochem. Biophys. Res. Commun. 209, 944-952 (1995)JOURNAL Biochem. Biophys. Res. Commun. 209, 944-952 (1995) STANDARD full automaticSTANDARD full automaticCOMMENT Submitted (10-Mar-1995) to DDBJ by:COMMENT Submitted (10-Mar-1995) to DDBJ by: Takashi MurakamiTakashi Murakami Department of Laboratory MedicineDepartment of Laboratory Medicine School of MedicineSchool of Medicine University of TokushimaUniversity of Tokushima Kuramotocho 3-chomeKuramotocho 3-chome Tokushima 770Tokushima 770 JapanJapan Phone: +81-886-33-7184Phone: +81-886-33-7184 Fax: +81-886-31-9495.Fax: +81-886-31-9495.
[continued][continued]
![Page 7: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright 1996-2001. All rights reserved](https://reader035.vdocuments.us/reader035/viewer/2022062308/56649ea35503460f94ba7f4d/html5/thumbnails/7.jpg)
Examples of ASCII sequence file formatsExamples of ASCII sequence file formats GenBank [continued]GenBank [continued]NCBI gi: 995614NCBI gi: 995614FEATURES Location/QualifiersFEATURES Location/Qualifiers source 1..539source 1..539 /organism="Rattus norvegicus"/organism="Rattus norvegicus" /strain="OLETF, LETO and Zucker"/strain="OLETF, LETO and Zucker" /dev_stage="differentiated"/dev_stage="differentiated" /sequenced_mol="cDNA to mRNA"/sequenced_mol="cDNA to mRNA" /tissue_type="adipose"/tissue_type="adipose" CDS 30..533CDS 30..533 /partial/partial /note="NCBI gi: 995615"/note="NCBI gi: 995615" /codon_start=1/codon_start=1 /product="obese"/product="obese" /translation="MCWRPLCRFLWLWSYLSYVQAVPIHKVQDDTKTLIKTIVTRIND/translation="MCWRPLCRFLWLWSYLSYVQAVPIHKVQDDTKTLIKTIVTRIND ISHTQSVSARQRVTGLDFIPGLHPILSLSKMDQTLAVYQQILTSLPSQNVLQIAHDLEISHTQSVSARQRVTGLDFIPGLHPILSLSKMDQTLAVYQQILTSLPSQNVLQIAHDLE NLRDLLHLLAFSKSCSLPQTRGLQKPESLDGVLEASLYSTEVVALSRLQGSLQDILQQNLRDLLHLLAFSKSCSLPQTRGLQKPESLDGVLEASLYSTEVVALSRLQGSLQDILQQ LDLSPEC"LDLSPEC"BASE COUNT 121 a 167 c 133 g 118 tBASE COUNT 121 a 167 c 133 g 118 tORIGINORIGIN 1 ccaagaagaa gaagacccca gcgaggaaaa tgtgctggag acccctgtgc cggttcctgt1 ccaagaagaa gaagacccca gcgaggaaaa tgtgctggag acccctgtgc cggttcctgt 61 ggctttggtc ctatctgtcc tatgttcaag ctgtgcctat ccacaaagtc caggatgaca61 ggctttggtc ctatctgtcc tatgttcaag ctgtgcctat ccacaaagtc caggatgaca 121 ccaaaaccct catcaagacc attgtcacca ggatcaatga catttcacac acgcagtcgg121 ccaaaaccct catcaagacc attgtcacca ggatcaatga catttcacac acgcagtcgg 181 tatccgccag gcagagggtc accggtttgg acttcattcc cgggcttcac cccattctga181 tatccgccag gcagagggtc accggtttgg acttcattcc cgggcttcac cccattctga 241 gtttgtccaa gatggaccag accctggcag tctatcaaca gatcctcacc agcttgcctt241 gtttgtccaa gatggaccag accctggcag tctatcaaca gatcctcacc agcttgcctt 301 cccaaaacgt gctgcagata gctcatgacc tggagaacct gcgagacctc ctccatctgc301 cccaaaacgt gctgcagata gctcatgacc tggagaacct gcgagacctc ctccatctgc 361 tggccttctc caagagctgc tccctgccgc agacccgtgg cctgcagaag ccagagagcc361 tggccttctc caagagctgc tccctgccgc agacccgtgg cctgcagaag ccagagagcc 421 tggatggcgt cctggaagcc tcgctctact ccacagaggt ggtggctctg agcaggctgc421 tggatggcgt cctggaagcc tcgctctact ccacagaggt ggtggctctg agcaggctgc 481 agggctctct gcaggacatt cttcaacagt tggaccttag ccctgaatgc tgaggtttc481 agggctctct gcaggacatt cttcaacagt tggaccttag ccctgaatgc tgaggtttc////
![Page 8: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright 1996-2001. All rights reserved](https://reader035.vdocuments.us/reader035/viewer/2022062308/56649ea35503460f94ba7f4d/html5/thumbnails/8.jpg)
Examples of ASCII sequence file formatsExamples of ASCII sequence file formats GCG (GCG (MacVector, GCGMacVector, GCG))LOCUS RATOBESE.G 539 BP SS-RNA ENTERED 09/23/95LOCUS RATOBESE.G 539 BP SS-RNA ENTERED 09/23/95DEFINITION Rat mRNA for obese.DEFINITION Rat mRNA for obese.ACCESSION -ACCESSION -KEYWORDS -KEYWORDS -SOURCE Rattus norvegicus; Norway ratSOURCE Rattus norvegicus; Norway rat ORGANISM Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata; Vertebrata;ORGANISM Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata; Vertebrata; Sarcopterygii; Mammalia; Eutheria; Rodentia; Sciurognathi;Sarcopterygii; Mammalia; Eutheria; Rodentia; Sciurognathi; Myomorpha; Muridae; Murinae; RattusMyomorpha; Muridae; Murinae; RattusREFERENCE [1]REFERENCE [1] AUTHORS Murakami, T. & Shima, K.AUTHORS Murakami, T. & Shima, K. TITLE Cloning of rat obese cDNA and its expression in obese rats.TITLE Cloning of rat obese cDNA and its expression in obese rats. JOURNAL Biochem. Biophys. Res. Commun., 209, 3, 944-952, (1995)JOURNAL Biochem. Biophys. Res. Commun., 209, 3, 944-952, (1995)COMMENT Database Reference:COMMENT Database Reference: DDBJ RATOBESEDDBJ RATOBESE Accession: D49653Accession: D49653 ------------ ------------ Submitted (10-Mar-1995) to DDBJ by: Submitted (10-Mar-1995) to DDBJ by: Takashi Murakami Takashi Murakami Department of Laboratory Medicine Department of Laboratory Medicine School of Medicine School of Medicine University of Tokushima University of Tokushima Kuramotocho 3-chome Kuramotocho 3-chome Tokushima 770 Tokushima 770 Japan Japan Phone: +81-886-33-7184 Phone: +81-886-33-7184 Fax: +81-886-31-9495 Fax: +81-886-31-9495
[continued][continued]
![Page 9: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright 1996-2001. All rights reserved](https://reader035.vdocuments.us/reader035/viewer/2022062308/56649ea35503460f94ba7f4d/html5/thumbnails/9.jpg)
Examples of ASCII sequence file formatsExamples of ASCII sequence file formats GCG [continued]GCG [continued]FEATURES From To/Span DescriptionFEATURES From To/Span Description pept 30 533 obesepept 30 533 obese ???? 1 539 source; /organism=Rattus norvegicus;???? 1 539 source; /organism=Rattus norvegicus; /strain=OLETF, LETO and Zucker;/strain=OLETF, LETO and Zucker; /dev_stage=differentiated; /sequenced_mol=cDNA/dev_stage=differentiated; /sequenced_mol=cDNA to mRNA; /tissue_type=adiposeto mRNA; /tissue_type=adiposeBASE COUNT 121 A 167 C 133 G 118 T 0 OTHERBASE COUNT 121 A 167 C 133 G 118 T 0 OTHERORIGIN ?ORIGIN ? RATOBESE.G Length: 539 Jan 30, 1996 - 05:32 PM Check: 5797 ..RATOBESE.G Length: 539 Jan 30, 1996 - 05:32 PM Check: 5797 .. 1 CCAAGAAGAA GAAGACCCCA GCGAGGAAAA TGTGCTGGAG ACCCCTGTGC CGGTTCCTGT1 CCAAGAAGAA GAAGACCCCA GCGAGGAAAA TGTGCTGGAG ACCCCTGTGC CGGTTCCTGT 61 GGCTTTGGTC CTATCTGTCC TATGTTCAAG CTGTGCCTAT CCACAAAGTC CAGGATGACA61 GGCTTTGGTC CTATCTGTCC TATGTTCAAG CTGTGCCTAT CCACAAAGTC CAGGATGACA 121 CCAAAACCCT CATCAAGACC ATTGTCACCA GGATCAATGA CATTTCACAC ACGCAGTCGG121 CCAAAACCCT CATCAAGACC ATTGTCACCA GGATCAATGA CATTTCACAC ACGCAGTCGG 181 TATCCGCCAG GCAGAGGGTC ACCGGTTTGG ACTTCATTCC CGGGCTTCAC CCCATTCTGA181 TATCCGCCAG GCAGAGGGTC ACCGGTTTGG ACTTCATTCC CGGGCTTCAC CCCATTCTGA 241 GTTTGTCCAA GATGGACCAG ACCCTGGCAG TCTATCAACA GATCCTCACC AGCTTGCCTT241 GTTTGTCCAA GATGGACCAG ACCCTGGCAG TCTATCAACA GATCCTCACC AGCTTGCCTT 301 CCCAAAACGT GCTGCAGATA GCTCATGACC TGGAGAACCT GCGAGACCTC CTCCATCTGC301 CCCAAAACGT GCTGCAGATA GCTCATGACC TGGAGAACCT GCGAGACCTC CTCCATCTGC 361 TGGCCTTCTC CAAGAGCTGC TCCCTGCCGC AGACCCGTGG CCTGCAGAAG CCAGAGAGCC361 TGGCCTTCTC CAAGAGCTGC TCCCTGCCGC AGACCCGTGG CCTGCAGAAG CCAGAGAGCC 421 TGGATGGCGT CCTGGAAGCC TCGCTCTACT CCACAGAGGT GGTGGCTCTG AGCAGGCTGC421 TGGATGGCGT CCTGGAAGCC TCGCTCTACT CCACAGAGGT GGTGGCTCTG AGCAGGCTGC 481 AGGGCTCTCT GCAGGACATT CTTCAACAGT TGGACCTTAG CCCTGAATGC TGAGGTTTC481 AGGGCTCTCT GCAGGACATT CTTCAACAGT TGGACCTTAG CCCTGAATGC TGAGGTTTC////
![Page 10: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright 1996-2001. All rights reserved](https://reader035.vdocuments.us/reader035/viewer/2022062308/56649ea35503460f94ba7f4d/html5/thumbnails/10.jpg)
Sequence file format tipsSequence file format tips
When saving a sequence for use in an email When saving a sequence for use in an email message or pasting into a web page, use an message or pasting into a web page, use an unannotated text unannotated text format such as format such as FASTAFASTA
When retrieving from a database or exchanging When retrieving from a database or exchanging between programs, use an between programs, use an annotated text annotated text format format such as such as GCGGCG or or GenbankGenbank
When using sequence again with the same When using sequence again with the same program, use that program’s program, use that program’s annotated binary annotated binary format (or format (or annotated text annotated text if binary not available)if binary not available)
![Page 11: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright 1996-2001. All rights reserved](https://reader035.vdocuments.us/reader035/viewer/2022062308/56649ea35503460f94ba7f4d/html5/thumbnails/11.jpg)
Sequence assemblySequence assembly
Goal: Assemble pieces of sequence into Goal: Assemble pieces of sequence into single, continuous sequencesingle, continuous sequence
Early commercial system to do sequence Early commercial system to do sequence assembly was the GCG assembly was the GCG GelOverlap/GelAssemble suite (VMS,Unix)GelOverlap/GelAssemble suite (VMS,Unix)
We will use AssemblyLIGN (Macintosh), a We will use AssemblyLIGN (Macintosh), a companion to MacVectorcompanion to MacVector
![Page 12: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright 1996-2001. All rights reserved](https://reader035.vdocuments.us/reader035/viewer/2022062308/56649ea35503460f94ba7f4d/html5/thumbnails/12.jpg)
Sequence assembly: TermsSequence assembly: Terms
projectproject collection of collection of fragmentsfragments, , templatestemplates and and contigscontigs
fragmentsfragments pieces of sequence entered by user or read from pieces of sequence entered by user or read from
filesfiles contigscontigs
lists of aligned fragments generated (normally) lists of aligned fragments generated (normally) by programby program
![Page 13: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright 1996-2001. All rights reserved](https://reader035.vdocuments.us/reader035/viewer/2022062308/56649ea35503460f94ba7f4d/html5/thumbnails/13.jpg)
Sequence assembly: TermsSequence assembly: Terms
templatestemplates any sequence to be searched forany sequence to be searched for
can be entered by usercan be entered by user can be read from system filescan be read from system files
most common example is sequence of vector most common example is sequence of vector sequences in templates are NOT included in sequences in templates are NOT included in
assembled sequences unless they are ALSO assembled sequences unless they are ALSO present in a fragment (and have not been present in a fragment (and have not been removed)removed)
![Page 14: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright 1996-2001. All rights reserved](https://reader035.vdocuments.us/reader035/viewer/2022062308/56649ea35503460f94ba7f4d/html5/thumbnails/14.jpg)
Sequence assembly: File organizationSequence assembly: File organization AssemblyLIGN keeps all information AssemblyLIGN keeps all information
(including sequences) in a single project (including sequences) in a single project documentdocument
GCG keeps all information in a directory GCG keeps all information in a directory (and its subdirectories), with each fragment (and its subdirectories), with each fragment in a separate file in a separate file
![Page 15: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright 1996-2001. All rights reserved](https://reader035.vdocuments.us/reader035/viewer/2022062308/56649ea35503460f94ba7f4d/html5/thumbnails/15.jpg)
Sequence assembly: StepsSequence assembly: Steps
Data entry/import (fragments, templates)Data entry/import (fragments, templates) Removal of unwanted sequenceRemoval of unwanted sequence Automated creation of contigsAutomated creation of contigs Manual editing/confirmationManual editing/confirmation ExportExport
![Page 16: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright 1996-2001. All rights reserved](https://reader035.vdocuments.us/reader035/viewer/2022062308/56649ea35503460f94ba7f4d/html5/thumbnails/16.jpg)
Automated creation of contigsAutomated creation of contigs
StepsSteps1. Finding pairwise overlaps1. Finding pairwise overlaps
2. Resolving overlaps2. Resolving overlaps
3. Improving alignment3. Improving alignment
4. Final assembly and consensus generation4. Final assembly and consensus generation
![Page 17: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright 1996-2001. All rights reserved](https://reader035.vdocuments.us/reader035/viewer/2022062308/56649ea35503460f94ba7f4d/html5/thumbnails/17.jpg)
1. Finding pairwise overlaps1. Finding pairwise overlaps
Compare each fragment (and its Compare each fragment (and its complement) with each other fragmentcomplement) with each other fragment
Generate list of regions of similarity that Generate list of regions of similarity that meet criteria belowmeet criteria below ParametersParameters
minimum overlap length (default 20)minimum overlap length (default 20) stringency (% of bases that must match, default 70)stringency (% of bases that must match, default 70) minimum repeat length (default 30)minimum repeat length (default 30)
![Page 18: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright 1996-2001. All rights reserved](https://reader035.vdocuments.us/reader035/viewer/2022062308/56649ea35503460f94ba7f4d/html5/thumbnails/18.jpg)
1. Finding pairwise overlaps1. Finding pairwise overlaps
Each fragment may appear in more than one Each fragment may appear in more than one overlapoverlap
1 83 5
36
8 2
7 9
5 4
![Page 19: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright 1996-2001. All rights reserved](https://reader035.vdocuments.us/reader035/viewer/2022062308/56649ea35503460f94ba7f4d/html5/thumbnails/19.jpg)
2. Resolving overlaps2. Resolving overlaps Build larger pieces by combining overlapsBuild larger pieces by combining overlaps
1 8 3 5
36 8 2
7 95 4
![Page 20: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright 1996-2001. All rights reserved](https://reader035.vdocuments.us/reader035/viewer/2022062308/56649ea35503460f94ba7f4d/html5/thumbnails/20.jpg)
2. Resolving overlaps2. Resolving overlaps Build larger pieces by combining overlapsBuild larger pieces by combining overlaps
1 8 3 5
36 8 2
7 95 4
1 8 2
![Page 21: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright 1996-2001. All rights reserved](https://reader035.vdocuments.us/reader035/viewer/2022062308/56649ea35503460f94ba7f4d/html5/thumbnails/21.jpg)
2. Resolving overlaps2. Resolving overlaps Build larger pieces by combining overlapsBuild larger pieces by combining overlaps
3 5
36
7 95 4
1 8 2
![Page 22: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright 1996-2001. All rights reserved](https://reader035.vdocuments.us/reader035/viewer/2022062308/56649ea35503460f94ba7f4d/html5/thumbnails/22.jpg)
2. Resolving overlaps2. Resolving overlaps Build larger pieces by combining overlapsBuild larger pieces by combining overlaps
3 5
36
7 95 4
1 8 2
536
![Page 23: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright 1996-2001. All rights reserved](https://reader035.vdocuments.us/reader035/viewer/2022062308/56649ea35503460f94ba7f4d/html5/thumbnails/23.jpg)
2. Resolving overlaps2. Resolving overlaps Build larger pieces by combining overlapsBuild larger pieces by combining overlaps
7 95 4
1 8 2
536
![Page 24: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright 1996-2001. All rights reserved](https://reader035.vdocuments.us/reader035/viewer/2022062308/56649ea35503460f94ba7f4d/html5/thumbnails/24.jpg)
2. Resolving overlaps2. Resolving overlaps Build larger pieces by combining overlapsBuild larger pieces by combining overlaps
7 95 4
1 8 2
536
4
![Page 25: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright 1996-2001. All rights reserved](https://reader035.vdocuments.us/reader035/viewer/2022062308/56649ea35503460f94ba7f4d/html5/thumbnails/25.jpg)
2. Resolving overlaps2. Resolving overlaps Build larger pieces by combining overlapsBuild larger pieces by combining overlaps
7 9
1 8 2
536
4
![Page 26: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright 1996-2001. All rights reserved](https://reader035.vdocuments.us/reader035/viewer/2022062308/56649ea35503460f94ba7f4d/html5/thumbnails/26.jpg)
3. Improve alignment3. Improve alignment
Introduce gaps in sequences if they will Introduce gaps in sequences if they will improve overlapsimprove overlaps alignment parametersalignment parameters
gap creation penalty (default 2.0)gap creation penalty (default 2.0) gap extension penalty (default (0.1)gap extension penalty (default (0.1)
![Page 27: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright 1996-2001. All rights reserved](https://reader035.vdocuments.us/reader035/viewer/2022062308/56649ea35503460f94ba7f4d/html5/thumbnails/27.jpg)
4. Final assembly and consensus generation4. Final assembly and consensus generation Mark fragments that are now part of a contig (no Mark fragments that are now part of a contig (no
longer appear by themselves)longer appear by themselves) Form consensus for each contig by “reading” Form consensus for each contig by “reading”
along aligned sequences and converting to IUB along aligned sequences and converting to IUB codes by consensus rulescodes by consensus rules consensus parameterconsensus parameter
base designation threshold (% of all bases at a given base designation threshold (% of all bases at a given position that must be the same for that base to be assigned position that must be the same for that base to be assigned to the consensus; otherwise, less specific IUB code used; to the consensus; otherwise, less specific IUB code used; default 80%)default 80%)
![Page 28: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright 1996-2001. All rights reserved](https://reader035.vdocuments.us/reader035/viewer/2022062308/56649ea35503460f94ba7f4d/html5/thumbnails/28.jpg)
Manual consensus editingManual consensus editing
Crucial to verify alignment and resolve Crucial to verify alignment and resolve ambiguities (e.g., sequencing errors)ambiguities (e.g., sequencing errors)
Program keeps an “edit history” that tracks Program keeps an “edit history” that tracks all changes made to the original sequences: all changes made to the original sequences: essential to be able to “retrace your steps” essential to be able to “retrace your steps” from original sequencing gels (e.g., in case from original sequencing gels (e.g., in case of conflicts with sequences in database)of conflicts with sequences in database)
![Page 29: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright 1996-2001. All rights reserved](https://reader035.vdocuments.us/reader035/viewer/2022062308/56649ea35503460f94ba7f4d/html5/thumbnails/29.jpg)
AssemblyLIGN TutorialAssemblyLIGN Tutorial
OpenOpen “demo π” project “demo π” project
![Page 30: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright 1996-2001. All rights reserved](https://reader035.vdocuments.us/reader035/viewer/2022062308/56649ea35503460f94ba7f4d/html5/thumbnails/30.jpg)
AssemblyLIGN TutorialAssemblyLIGN Tutorial
Goal: Goal: Eliminate Eliminate vector vector sequencesequence
Double-click Double-click vectorvector
Select all Select all fragmentsfragments
Drop on Drop on vectorvector
![Page 31: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright 1996-2001. All rights reserved](https://reader035.vdocuments.us/reader035/viewer/2022062308/56649ea35503460f94ba7f4d/html5/thumbnails/31.jpg)
AssemblyLIGN TutorialAssemblyLIGN Tutorial
““vector Alignments” window shows that vector Alignments” window shows that frag8frag8 contains vector sequence contains vector sequence
Click on ‘shadow’ to edit Click on ‘shadow’ to edit frag8frag8 and display and display highlighted vector sequencehighlighted vector sequence
![Page 32: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright 1996-2001. All rights reserved](https://reader035.vdocuments.us/reader035/viewer/2022062308/56649ea35503460f94ba7f4d/html5/thumbnails/32.jpg)
AssemblyLIGN TutorialAssemblyLIGN Tutorial
Highlighted Highlighted sequence sequence doesn’t doesn’t look like look like sequence in sequence in “vector” “vector” windowwindow
![Page 33: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright 1996-2001. All rights reserved](https://reader035.vdocuments.us/reader035/viewer/2022062308/56649ea35503460f94ba7f4d/html5/thumbnails/33.jpg)
AssemblyLIGN TutorialAssemblyLIGN Tutorial
Click on “vector” windowClick on “vector” window ChooseChoose Select All Select All (Edit (Edit
Menu)Menu) ChooseChoose Reverse & Reverse &
Complement Complement (Edit Menu)(Edit Menu) Now highlighted sequence Now highlighted sequence
in in frag8frag8 matches that in matches that in “vector” window“vector” window
![Page 34: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright 1996-2001. All rights reserved](https://reader035.vdocuments.us/reader035/viewer/2022062308/56649ea35503460f94ba7f4d/html5/thumbnails/34.jpg)
AssemblyLIGN TutorialAssemblyLIGN Tutorial
Click on “frag8” windowClick on “frag8” window Delete highlighted sequenceDelete highlighted sequence Then close “frag8” windowThen close “frag8” window
![Page 35: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright 1996-2001. All rights reserved](https://reader035.vdocuments.us/reader035/viewer/2022062308/56649ea35503460f94ba7f4d/html5/thumbnails/35.jpg)
AssemblyLIGN TutorialAssemblyLIGN Tutorial
ChooseChoose Select All Select All (Edit Menu)(Edit Menu) ChooseChoose AssembleAssemble (Project Menu) (Project Menu)
![Page 36: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright 1996-2001. All rights reserved](https://reader035.vdocuments.us/reader035/viewer/2022062308/56649ea35503460f94ba7f4d/html5/thumbnails/36.jpg)
AssemblyLIGN TutorialAssemblyLIGN Tutorial
All but one All but one fragment fragment ((frag14frag14) ) combined into combined into Untitled Untitled Config 1Config 1
![Page 37: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright 1996-2001. All rights reserved](https://reader035.vdocuments.us/reader035/viewer/2022062308/56649ea35503460f94ba7f4d/html5/thumbnails/37.jpg)
AssemblyLIGN TutorialAssemblyLIGN Tutorial
Goal: Try relaxing Goal: Try relaxing assembly parameters to assembly parameters to merge merge frag14frag14 into the into the contigcontig
Choose Choose Assembly Assembly Options Options (Project Menu)(Project Menu)
Reduce minimum Reduce minimum overlap length to 5overlap length to 5
![Page 38: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright 1996-2001. All rights reserved](https://reader035.vdocuments.us/reader035/viewer/2022062308/56649ea35503460f94ba7f4d/html5/thumbnails/38.jpg)
AssemblyLIGN TutorialAssemblyLIGN Tutorial
Now all Now all fragments are fragments are mergedmerged
Double-click Double-click Untitled Untitled Contig 2 Contig 2 to to see alignment see alignment and consensusand consensus
![Page 39: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright 1996-2001. All rights reserved](https://reader035.vdocuments.us/reader035/viewer/2022062308/56649ea35503460f94ba7f4d/html5/thumbnails/39.jpg)
AssemblyLIGN TutorialAssemblyLIGN Tutorial
Map shows gross alignments of fragmentsMap shows gross alignments of fragments Click on Magnifying glass ‘A’ and select Click on Magnifying glass ‘A’ and select
region of map to viewregion of map to view
![Page 40: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright 1996-2001. All rights reserved](https://reader035.vdocuments.us/reader035/viewer/2022062308/56649ea35503460f94ba7f4d/html5/thumbnails/40.jpg)
AssemblyLIGN TutorialAssemblyLIGN Tutorial
Positions that do not match at/above the Positions that do not match at/above the Base Designation Threshold are highlighted Base Designation Threshold are highlighted in the consensus or the original sequencesin the consensus or the original sequences
![Page 41: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright 1996-2001. All rights reserved](https://reader035.vdocuments.us/reader035/viewer/2022062308/56649ea35503460f94ba7f4d/html5/thumbnails/41.jpg)
Can decrease the Base Designation Threshold Can decrease the Base Designation Threshold to reduce ‘uncalled’ basesto reduce ‘uncalled’ bases
![Page 42: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright 1996-2001. All rights reserved](https://reader035.vdocuments.us/reader035/viewer/2022062308/56649ea35503460f94ba7f4d/html5/thumbnails/42.jpg)
Reading for Next ClassReading for Next Class
Baxellanis & Ouellette, Chapter 7Baxellanis & Ouellette, Chapter 7 Sequence Analysis Primer, pp. 90-124 Sequence Analysis Primer, pp. 90-124
“Similarity versus Homology” and “Dot “Similarity versus Homology” and “Dot Matrix Methods” (on web page)Matrix Methods” (on web page)
(03-510) Durbin et al, pp. 12-17(03-510) Durbin et al, pp. 12-17
![Page 43: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright 1996-2001. All rights reserved](https://reader035.vdocuments.us/reader035/viewer/2022062308/56649ea35503460f94ba7f4d/html5/thumbnails/43.jpg)
Summary, Part 6Summary, Part 6
A variety of sequence file formats are A variety of sequence file formats are currently in use. Files can be either text or currently in use. Files can be either text or binary, and can consist only of sequence or binary, and can consist only of sequence or also include annotation information.also include annotation information.
The choice of file format is dictated by the The choice of file format is dictated by the requirements of the analysis desired and the requirements of the analysis desired and the subset of formats compatible between the subset of formats compatible between the “writing” and “reading” program.“writing” and “reading” program.
![Page 44: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright 1996-2001. All rights reserved](https://reader035.vdocuments.us/reader035/viewer/2022062308/56649ea35503460f94ba7f4d/html5/thumbnails/44.jpg)
Summary, Part 6Summary, Part 6
Sequence assembly requires the ability to Sequence assembly requires the ability to compare sequences to find regions of high compare sequences to find regions of high homology.homology.
Pieces of sequence are assembled by Pieces of sequence are assembled by “connecting” them via regions of overlap.“connecting” them via regions of overlap.
A consensus sequence can be generated A consensus sequence can be generated from the connected pieces (using user-from the connected pieces (using user-specified rules to resolve ambiguity).specified rules to resolve ambiguity).
![Page 45: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright 1996-2001. All rights reserved](https://reader035.vdocuments.us/reader035/viewer/2022062308/56649ea35503460f94ba7f4d/html5/thumbnails/45.jpg)
Sequence comparisons using BLAST server Web PageSequence comparisons using BLAST server Web Page Main BLAST web page URLMain BLAST web page URL
http://www.ncbi.nlm.nih.gov/BLAST/http://www.ncbi.nlm.nih.gov/BLAST/ Links to Links to BasicBasic and and AdvancedAdvanced Search Pages Search Pages Two main BLAST programsTwo main BLAST programs
blastnblastn - compares user nucleotide sequence to - compares user nucleotide sequence to nucleotide sequences in databasenucleotide sequences in database
blastpblastp - compares user peptide sequence to - compares user peptide sequence to peptide sequences in databasepeptide sequences in database