![Page 1: Sequence information and file formats An Introduction to Bioinformatics](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649e4f5503460f94b46a28/html5/thumbnails/1.jpg)
Sequence information and file formats
An Introduction to Bioinformatics
![Page 2: Sequence information and file formats An Introduction to Bioinformatics](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649e4f5503460f94b46a28/html5/thumbnails/2.jpg)
AIMS
OBJECTIVES
To understand the conventions regarding the presentation of DNA and protein sequence information
To understand the logic underlying these conventions
To become familiar with the commonly used sequence file formats
To become familiar with the READSEQ programme for the interconversion of file formats
Present a nucleotide or protein sequence according to accepted conventions
Recognize different sequence files formats
Interconvert files between formats
![Page 3: Sequence information and file formats An Introduction to Bioinformatics](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649e4f5503460f94b46a28/html5/thumbnails/3.jpg)
Virtually all the information one deals with in computational molecular biology is either in the form of DNA or protein sequences
There are conventions applying to the presentation/storage of sequence information
The way in which sequence information is stored, retrieved and manipulated varies
There are different computer file types for sequence information
INTRODUCTION
![Page 4: Sequence information and file formats An Introduction to Bioinformatics](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649e4f5503460f94b46a28/html5/thumbnails/4.jpg)
DNA
The DNA of living organisms is normally double stranded
It is the convention to show only one strand of the DNA
Which strand do you show?
Which way round do you show it?
![Page 5: Sequence information and file formats An Introduction to Bioinformatics](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649e4f5503460f94b46a28/html5/thumbnails/5.jpg)
It is usually the case that either strand can be the template (or coding) strand at any particular point
Given that the two strands are anti-parallel, the genes on the two strands will face in opposite directions
The orientation of a DNA strand is determined by which end has a 5'-phosphate group and which has a 3'-hydroxyl group
![Page 6: Sequence information and file formats An Introduction to Bioinformatics](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649e4f5503460f94b46a28/html5/thumbnails/6.jpg)
RNA polymerase in all organisms moves along the template strand of the DNA in the 3'-5' direction producing RNA that grows in the 5'-3' direction
The RNA sequence will be identical to that of the non-template strand, except for the presence of uracil instead of thymine
The convention is to show the non-template strand of the DNA because it resembles the RNA
3’ GGCATAGCAGGTACGTTATGCCAGCATTG 5’ template
5’ CCGTATCGTCCATGCAATACGGTCGTAAC 3’ non-template
5’ CCGUAUCGUCCAUGCAAUACGGUCGUAAC 3’ mRNA
![Page 7: Sequence information and file formats An Introduction to Bioinformatics](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649e4f5503460f94b46a28/html5/thumbnails/7.jpg)
For purely cultural reasons the sequence is shown running from right to left on the page, with the 5' end of the sequence on the right
Sometimes when sequencing projects are in the draft state there are still ambiguities in the sequence
IUPAC have defined a standard table for the nucleotide ambiguity codes
R = A or G K = G or T S = G or C
Y = C or T M = A or C W = A or T
B = not A H = not G N = any
D = not C V = not T
![Page 8: Sequence information and file formats An Introduction to Bioinformatics](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649e4f5503460f94b46a28/html5/thumbnails/8.jpg)
PROTEIN
Polypeptides have a polarity, with an N-terminal and C-terminal ends possessing a free amino group and carboxyl group respectively
A polypeptide is presented with its N-terminus on the left of and its C-terminus on the right.
To keep the polypeptide information in a form that can be conveniently handled by computers the amino acids are each given a single letter code
H2N-Methionine-Valine-Tyrosine-Glycine-Isoleucine-Lysine-COOH
![Page 9: Sequence information and file formats An Introduction to Bioinformatics](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649e4f5503460f94b46a28/html5/thumbnails/9.jpg)
Myoglobin
![Page 10: Sequence information and file formats An Introduction to Bioinformatics](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649e4f5503460f94b46a28/html5/thumbnails/10.jpg)
Glycine G Isoleucine I Cysteine C Tryptophan W Arginine R
Alanine A Phenylalanine F Threonine T Proline P Lysine K
Valine V Tyrosine Y Methionine M Aspartate D Histidine H
Leucine L Serine S Asparagine N Glutamate G Glutamine Q
H2N-Methionine-Valine-Tyrosine-Glycine-Isoleucine-Lysine-COOH
MVYGIK
![Page 11: Sequence information and file formats An Introduction to Bioinformatics](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649e4f5503460f94b46a28/html5/thumbnails/11.jpg)
FILE TYPES
Many software packages have been developed for the analysis of DNA and protein sequences
A variety of different file formats have been developed to store/analyse DNA and protein sequence information
The various software packages will usually only accept a specific file format
The situation is made worse by the fact that different databases hold the information in different file formats
An essential skill is be able to recognize the different formatsand to be able to interconvert files between formats
![Page 12: Sequence information and file formats An Introduction to Bioinformatics](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649e4f5503460f94b46a28/html5/thumbnails/12.jpg)
IG/Stanford Fitch Plain/Raw
GenBank/GB Fasta/Pearson PIR/CODATA
NBRF Zuker MSF
EMBL Olsen ASN 1.8
GCG Phylip 3.2 PAUP/NEXUS
DNAStrider Phylip Pretty
Common File Formats
![Page 13: Sequence information and file formats An Introduction to Bioinformatics](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649e4f5503460f94b46a28/html5/thumbnails/13.jpg)
PLAIN SEQUENCE FORMAT
A sequence in plain format may contain only IUPAC characters and
spaces (no numbers!).
Note: A file in plain sequence format may only contain one sequence,
while most other formats accept several sequences in one file.
An example sequence in plain format is:
AACCTGCGGAAGGATCATTACCGAGTGCGGGTCCTTTGGGCCCAACCTCCCATCCGTGTCTATTGTACCCTGTTGCTTCGGCGGGCCCGCCGCTTGTCGGCCGCCGGGGGGGCGCCTCTGCCCCCCGGGCCCGTGCCCGCCGGAGACCCCAACACGAACACTGTCTGAAAGCGTGCAGTCTGAGTTGATTGAATGCAATCAGTTAAAACTTTCAACAATGGATCT
![Page 14: Sequence information and file formats An Introduction to Bioinformatics](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649e4f5503460f94b46a28/html5/thumbnails/14.jpg)
FASTA FORMAT
A sequence file in FASTA format can contain several sequences.
One sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line must begin with a greater-than (">") symbol in the first column.
An example sequence in FASTA format is: >U03518 Aspergillus awamori internal transcribed spacer 1 (ITS1)AACCTGCGGAAGGATCATTACCGAGTGCGGGTCCTTTGGGCCCAACCTCCCATCCGTGTCTATTGTACCCTGTTGCTTCGGCGGGCCCGCCGCTTGTCGGCCGCCGGGGGGGCGCCTCTGCCCCCCGGGCCCGTGCCCGCCGGAGACCCCAACACGAACACTGTCTGAAAGCGTGCAGTCTGAGTTGATTGAATGCAATCAGTTAAAACTTTCAACAATGGATCTCTTGGTTCCGGC
![Page 15: Sequence information and file formats An Introduction to Bioinformatics](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649e4f5503460f94b46a28/html5/thumbnails/15.jpg)
EMBL FORMAT
A sequence file in EMBL format can contain several sequences.
One sequence entry starts with an identifier line ("ID "), followed by further annotation lines. The start of the sequence is marked by a line starting with "SQ"
and the end of the sequence is marked by two slashes ("//").
An example sequence in EMBL format is:
ID AA03518 standard; DNA; FUN; 237 BP.XXAC U03518;XXDE Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18SDE rRNA and 5.8S rRNA genes, partial sequence.XXSQ Sequence 237 BP; 41 A; 77 C; 67 G; 52 T; 0 other; aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc 60 tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg 120 ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc 180 tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc 237//
![Page 16: Sequence information and file formats An Introduction to Bioinformatics](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649e4f5503460f94b46a28/html5/thumbnails/16.jpg)
GENBANK FORMAT
A sequence file in GenBank format can contain several sequences.
One sequence in GenBank format starts with a line containing the word LOCUS and a number of annotation lines. The start of the sequence is marked by a line containing "ORIGIN" and the end of the sequence is marked by two slashes ("//").
An example sequence in GenBank format is:
LOCUS AAU03518 237 bp DNA PLN 04-FEB-1995DEFINITION Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S rRNA and 5.8S rRNA genes, partial sequence.ACCESSION U03518BASE COUNT 41 a 77 c 67 g 52 tORIGIN 1 aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc 61 tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg 121 ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc 181 tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc//
![Page 17: Sequence information and file formats An Introduction to Bioinformatics](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649e4f5503460f94b46a28/html5/thumbnails/17.jpg)
![Page 18: Sequence information and file formats An Introduction to Bioinformatics](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649e4f5503460f94b46a28/html5/thumbnails/18.jpg)
![Page 19: Sequence information and file formats An Introduction to Bioinformatics](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649e4f5503460f94b46a28/html5/thumbnails/19.jpg)
![Page 20: Sequence information and file formats An Introduction to Bioinformatics](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649e4f5503460f94b46a28/html5/thumbnails/20.jpg)
![Page 21: Sequence information and file formats An Introduction to Bioinformatics](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649e4f5503460f94b46a28/html5/thumbnails/21.jpg)
![Page 22: Sequence information and file formats An Introduction to Bioinformatics](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649e4f5503460f94b46a28/html5/thumbnails/22.jpg)
![Page 23: Sequence information and file formats An Introduction to Bioinformatics](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649e4f5503460f94b46a28/html5/thumbnails/23.jpg)
![Page 24: Sequence information and file formats An Introduction to Bioinformatics](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649e4f5503460f94b46a28/html5/thumbnails/24.jpg)
![Page 25: Sequence information and file formats An Introduction to Bioinformatics](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649e4f5503460f94b46a28/html5/thumbnails/25.jpg)