introduction to bioinformatics - university of ottawaintroduction to bioinformatics in this section,...

Molecular Biology-2019

1

INTRODUCTION TO BIOINFORMATICS

In this section, we want to provide a simple introduction to using the web site of the National

Center for Biotechnology Information NCBI) to obtain sequence information.

Link to NCBI web site: http://www.ncbi.nlm.nih.gov/

GENERAL SEARCH 1. The first tool we will explore is the basic search engine. Similar to google, you can enter

any combination of search terms or the specific accession number of the sequence of

interest in the search box. You can also specify which database to search from the drop

down menu to the left of the search box.

2. Let us say we are interested in finding information relative to myosin, a muscle protein.

Enter the word myosin in the search box and then click on “Search”. A new page will be

displayed, as shown on the next page, showing the number of records found within the

different databases.

http://www.ncbi.nlm.nih.gov/


2

3. The databases most frequently used in this course are the nucleotide and the protein

databases. Click on the nucleotide database to obtain the following page:

4. To refine your search, you may then choose the species and molecule type from the menus

on the left, or the specific taxon from the top organisms on the menu on the right. For this

example, we will first choose mRNA from the menu for molecule type. Then from the new


3

window that is displayed, we will choose records specific to zebra fish (Danio rerio) from

the top taxon’s menu.

5. A list of records corresponding to your search criteria will then be displayed. From there,

you can then search and access the specific record of interest. Information that can be

obtained from these records is explained further on in this exercise.

6. For your assignment, use this approach to find the Protein accession number for the

restriction enzyme BglII. What organism does this protein come from?

7. Use the general search engine to obtain the record with the accession number M68489.

8. Once you’ve obtained the record, answer the following questions for your assignment.

Is this a nucleotide or a protein record?

From which organism was this sequence obtained?

What is the name of the gene corresponding to this sequence?


4

SEARCHING WITH A NUCLEOTIDE SEQUENCE 1. The most common search engine used with either nucleotide or protein sequences is the

Basic Local Alignment Search Tool (BLAST). You can access this search engine either

from the popular resources menu on the right, or through the “Resource list (A-Z) menu”

on the left.

.

2. “Resource List (A-Z)”: On this page can be found most of the links you will be using

throughout the year.


5

3. Let’s explore Blast. Click on the link Blast. You should obtain the following page.

BLAST is a set of similarity search engines designed to explore all of the available sequence

databases regardless of whether the query is protein or DNA.

“Nucleotide blast” compares a nucleotide sequence against a nucleotide sequence database.

“Protein blast” Compares an amino acid query sequence against a protein sequence database.

“Blastx” compares a nucleotide query sequence translated in all reading frames against a

protein sequence database. You could use this option to find potential translation products of

an unknown nucleotide sequence.

“Tblastn” compares a protein query sequence against a nucleotide sequence database

dynamically translated in all reading frames.

“Tblastx” compares a translated nucleotide sequence against a nucleotide sequence database

dynamically translated in all reading frames.


6

We will first use this program to gain information on different sequences that you will be

working with. Note that one of these sequences represents the plasmid insert which you

must verify in lab exercise 2.

4. Click on the nucleotide BLAST (Blastn) option. You should obtain the following page:

5. Before we can enter a sequence query, we must make sure that the format of the latter be

one that is compatible with the program. Most sequence analysis software can handle a

format called FASTA. The FASTA format is a text file, without any numbers or any other

annotation, which is preceded by a descriptive line of text. Here is an example:

>John’s sequence123 (Press enter after this line)

AACGTCGGATTCAGGTACCCAGGAAAACTACATCTC

The first line of your file must begin with the following symbol :">". This symbol informs

the program that this line of text is for descriptive purposes only and that the sequence

information starts on the next line. You can write anything to identify the sequence on this line.

The next line represents the actual sequence.

6. Obtain the text document of unknown sequences available on the BIO3151 web page, by

following the link: Sequences>Unknown genes. This document contains five sequences

numbered 1-5. Convert each of these to FASTA format. You can do this in “NOTEPAD”.


7

7. Copy and paste the first sequence into the nucleotide blast query box. Choose the database

on which the search will be performed in the “Choose Search Set” menu. Choose “other”

and "nucleotide collection (nr/nt)" from the drop down menu.

8. Now choose the program to do the search from the “Program Selection” menu. Choose:

“Somewhat similar sequences (blastn)”. Check the box "Show results in a new page"

to display the results in a new browser window.

9. Click on “BLAST”. A new page will appear asking you to wait for the completion of your

request. This may be quite fast or slow depending on how heavily the demands on the

NCBI server are.


8

10. Once your request has been completed a new page will appear, as shown below, indicating

the results of your search.

11. Before analyzing the results, we will change the formatting options. Click on “Formatting

options” at the top of the page. A new menu will appear as shown below: Choose the

option “Old view” and then click on “Reformat”


9

12. The potential matches to your sequence will now be presented in three formats.

A graphical format such as the following:

If you scroll down, a textual format such as this one:


10

And further down, the actual sequence alignments:

For this exercise, the format we are interested in is the list of different records representing

matches.

Amongst the information that can be obtained are the following values:

Query coverage: This value indicates what extent of your input sequence (original query)

matches the sequence record found. For instance if the original query is 631 nucleotides long

and BLAST can align all 631 nucleotides of this query against a hit, then that would be 100%

coverage. Remember, Query Coverage does not take into account the length of the hit, only

the percentage of the query that aligns with the hit.

The Expect value (E) represents the number of sequence matches (HITS) that you would

expect to find if you were to search a database of random sequences. When E values are below

1, they are equivalent to the probability that two sequences will match to a certain extent. This

would mean that if we have an “E value” of 0.01, then there is a 1% chance that we would find

an equally good match in a database of random sequences. Often E values are very low.

In fact, if we have a perfect match, the “E value” might be given as zero. Two additional factors

have a strong influence on E values. These are the length of the sequence and the size of the

database. This is because it’s easier to find a perfect match to a shorter sequence.

It’s also easier to find a match in a larger database.


11

“Ident.”: BLAST calculates the percentage identity between the query and the hit in a

nucleotide-to-nucleotide alignment. How do you explain the fact that more than one sequence

possesses an identity of 100%?

Note that some of the sequences represent whole genome sequences, for example the first one

from this search. For this exercise you wish to obtain the sequence of the gene not the genome.

These are sometimes followed by the letter “G” (Note there may be other letters, but ignore

these). Notice in the above example that the record followed by a “G” states a 100% identity

but only 42% coverage. What does that mean?

13. Click on the accession number to view the record. You should obtain a record similar to

the one shown below:

To convert to FASTA

1

2

4

3

5 6

7 8


12

14. Information that can be obtained from a nucleotide sequence record:

The definition (#1): Provides a brief description of sequence; includes information

such as source organism, gene name/protein name, or some description of the

sequence's function.

The accession number (#2): The unique identifier for a sequence record.

Organism (#3): The formal scientific name for the source organism (genus and

species).

Source: (#4): Information including an abbreviated form of the organism name,

sometimes followed by a molecule type.

.

CDS (#5): Coding sequence; region of nucleotides that corresponds with the sequence

of amino acids in a protein (location includes start and stop codons). By clicking on

this link you may obtain the mRNA sequence from the Start to the Stop codons.

o Gene = (#6): The name of the gene.

o Product = (#7): The name of the gene’s protein product.

o Protein_id. (#8): This is the protein’s accession number. By clicking on this

link, you can obtain the protein record.

15. In several of the future exercises, you will be required to obtain and save these sequences

in FASTA format. To change the format to FASTA, choose “FASTA” at the top of the

sequence record. You should be redirected to a page like the following one:


13

16. You could now select and copy the description that is preceded by the symbol “>” as

well as the sequence and paste it in the program such as “Notepad” if you wished to

save the sequence in this format.

17. For your assignment, obtain the following information for each of the unknown

sequences on this course’s web site (Sequences > Unknown genes):

Accession number

Coverage

Ident.

E value

The definition

The organism from which this sequence was obtained

The gene name

The gene’s product name

The protein’s accession number

introduction to bioinformatics - university of ottawaintroduction to bioinformatics in this section,...

Documents