introduction to bioinformatics - university of ottawaintroduction to bioinformatics in this section,...
TRANSCRIPT
Molecular Biology-2019
1
INTRODUCTION TO BIOINFORMATICS
In this section, we want to provide a simple introduction to using the web site of the National
Center for Biotechnology Information NCBI) to obtain sequence information.
Link to NCBI web site: http://www.ncbi.nlm.nih.gov/
GENERAL SEARCH 1. The first tool we will explore is the basic search engine. Similar to google, you can enter
any combination of search terms or the specific accession number of the sequence of
interest in the search box. You can also specify which database to search from the drop
down menu to the left of the search box.
2. Let us say we are interested in finding information relative to myosin, a muscle protein.
Enter the word myosin in the search box and then click on “Search”. A new page will be
displayed, as shown on the next page, showing the number of records found within the
different databases.
Molecular Biology-2019
2
3. The databases most frequently used in this course are the nucleotide and the protein
databases. Click on the nucleotide database to obtain the following page:
4. To refine your search, you may then choose the species and molecule type from the menus
on the left, or the specific taxon from the top organisms on the menu on the right. For this
example, we will first choose mRNA from the menu for molecule type. Then from the new
Molecular Biology-2019
3
window that is displayed, we will choose records specific to zebra fish (Danio rerio) from
the top taxon’s menu.
5. A list of records corresponding to your search criteria will then be displayed. From there,
you can then search and access the specific record of interest. Information that can be
obtained from these records is explained further on in this exercise.
6. For your assignment, use this approach to find the Protein accession number for the
restriction enzyme BglII. What organism does this protein come from?
7. Use the general search engine to obtain the record with the accession number M68489.
8. Once you’ve obtained the record, answer the following questions for your assignment.
Is this a nucleotide or a protein record?
From which organism was this sequence obtained?
What is the name of the gene corresponding to this sequence?
Molecular Biology-2019
4
SEARCHING WITH A NUCLEOTIDE SEQUENCE 1. The most common search engine used with either nucleotide or protein sequences is the
Basic Local Alignment Search Tool (BLAST). You can access this search engine either
from the popular resources menu on the right, or through the “Resource list (A-Z) menu”
on the left.
.
2. “Resource List (A-Z)”: On this page can be found most of the links you will be using
throughout the year.
Molecular Biology-2019
5
3. Let’s explore Blast. Click on the link Blast. You should obtain the following page.
BLAST is a set of similarity search engines designed to explore all of the available sequence
databases regardless of whether the query is protein or DNA.
“Nucleotide blast” compares a nucleotide sequence against a nucleotide sequence database.
“Protein blast” Compares an amino acid query sequence against a protein sequence database.
“Blastx” compares a nucleotide query sequence translated in all reading frames against a
protein sequence database. You could use this option to find potential translation products of
an unknown nucleotide sequence.
“Tblastn” compares a protein query sequence against a nucleotide sequence database
dynamically translated in all reading frames.
“Tblastx” compares a translated nucleotide sequence against a nucleotide sequence database
dynamically translated in all reading frames.
Molecular Biology-2019
6
We will first use this program to gain information on different sequences that you will be
working with. Note that one of these sequences represents the plasmid insert which you
must verify in lab exercise 2.
4. Click on the nucleotide BLAST (Blastn) option. You should obtain the following page:
5. Before we can enter a sequence query, we must make sure that the format of the latter be
one that is compatible with the program. Most sequence analysis software can handle a
format called FASTA. The FASTA format is a text file, without any numbers or any other
annotation, which is preceded by a descriptive line of text. Here is an example:
>John’s sequence123 (Press enter after this line)
AACGTCGGATTCAGGTACCCAGGAAAACTACATCTC
The first line of your file must begin with the following symbol :">". This symbol informs
the program that this line of text is for descriptive purposes only and that the sequence
information starts on the next line. You can write anything to identify the sequence on this line.
The next line represents the actual sequence.
6. Obtain the text document of unknown sequences available on the BIO3151 web page, by
following the link: Sequences>Unknown genes. This document contains five sequences
numbered 1-5. Convert each of these to FASTA format. You can do this in “NOTEPAD”.
Molecular Biology-2019
7
7. Copy and paste the first sequence into the nucleotide blast query box. Choose the database
on which the search will be performed in the “Choose Search Set” menu. Choose “other”
and "nucleotide collection (nr/nt)" from the drop down menu.
8. Now choose the program to do the search from the “Program Selection” menu. Choose:
“Somewhat similar sequences (blastn)”. Check the box "Show results in a new page"
to display the results in a new browser window.
9. Click on “BLAST”. A new page will appear asking you to wait for the completion of your
request. This may be quite fast or slow depending on how heavily the demands on the
NCBI server are.
Molecular Biology-2019
8
10. Once your request has been completed a new page will appear, as shown below, indicating
the results of your search.
11. Before analyzing the results, we will change the formatting options. Click on “Formatting
options” at the top of the page. A new menu will appear as shown below: Choose the
option “Old view” and then click on “Reformat”
Molecular Biology-2019
9
12. The potential matches to your sequence will now be presented in three formats.
A graphical format such as the following:
If you scroll down, a textual format such as this one:
Molecular Biology-2019
10
And further down, the actual sequence alignments:
For this exercise, the format we are interested in is the list of different records representing
matches.
Amongst the information that can be obtained are the following values:
Query coverage: This value indicates what extent of your input sequence (original query)
matches the sequence record found. For instance if the original query is 631 nucleotides long
and BLAST can align all 631 nucleotides of this query against a hit, then that would be 100%
coverage. Remember, Query Coverage does not take into account the length of the hit, only
the percentage of the query that aligns with the hit.
The Expect value (E) represents the number of sequence matches (HITS) that you would
expect to find if you were to search a database of random sequences. When E values are below
1, they are equivalent to the probability that two sequences will match to a certain extent. This
would mean that if we have an “E value” of 0.01, then there is a 1% chance that we would find
an equally good match in a database of random sequences. Often E values are very low.
In fact, if we have a perfect match, the “E value” might be given as zero. Two additional factors
have a strong influence on E values. These are the length of the sequence and the size of the
database. This is because it’s easier to find a perfect match to a shorter sequence.
It’s also easier to find a match in a larger database.
Molecular Biology-2019
11
“Ident.”: BLAST calculates the percentage identity between the query and the hit in a
nucleotide-to-nucleotide alignment. How do you explain the fact that more than one sequence
possesses an identity of 100%?
Note that some of the sequences represent whole genome sequences, for example the first one
from this search. For this exercise you wish to obtain the sequence of the gene not the genome.
These are sometimes followed by the letter “G” (Note there may be other letters, but ignore
these). Notice in the above example that the record followed by a “G” states a 100% identity
but only 42% coverage. What does that mean?
13. Click on the accession number to view the record. You should obtain a record similar to
the one shown below:
To convert to FASTA
1
2
4
3
5 6
7 8
Molecular Biology-2019
12
14. Information that can be obtained from a nucleotide sequence record:
The definition (#1): Provides a brief description of sequence; includes information
such as source organism, gene name/protein name, or some description of the
sequence's function.
The accession number (#2): The unique identifier for a sequence record.
Organism (#3): The formal scientific name for the source organism (genus and
species).
Source: (#4): Information including an abbreviated form of the organism name,
sometimes followed by a molecule type.
.
CDS (#5): Coding sequence; region of nucleotides that corresponds with the sequence
of amino acids in a protein (location includes start and stop codons). By clicking on
this link you may obtain the mRNA sequence from the Start to the Stop codons.
o Gene = (#6): The name of the gene.
o Product = (#7): The name of the gene’s protein product.
o Protein_id. (#8): This is the protein’s accession number. By clicking on this
link, you can obtain the protein record.
15. In several of the future exercises, you will be required to obtain and save these sequences
in FASTA format. To change the format to FASTA, choose “FASTA” at the top of the
sequence record. You should be redirected to a page like the following one:
Molecular Biology-2019
13
16. You could now select and copy the description that is preceded by the symbol “>” as
well as the sequence and paste it in the program such as “Notepad” if you wished to
save the sequence in this format.
17. For your assignment, obtain the following information for each of the unknown
sequences on this course’s web site (Sequences > Unknown genes):
Accession number
Coverage
Ident.
E value
The definition
The organism from which this sequence was obtained
The gene name
The gene’s product name
The protein’s accession number