tics and homology modeling

55
Bioinformatics and Homology Modeling: A Student-Tested Tutorial for Beginners Exploring Human Visual Pigments Introduction This tutorial allows you to explore opsins -- the proteins that catch light for our eyes -- and the genes that code for opsins. But the real subject of this exercise is bioinformatics -- the use of computers to search for, explore, and use information about genes, nucleic acids, and proteins. While learning about the human opsins, you will use some of today's most powerful bioinformatics tools, and you will even build a model of a protein whose detailed structure is unknown (called homology modeling). You can follow up this tutorial with a study of opsins from other organisms, or by exploring any class of biomolecules that interest you.

Upload: junjunaru9435

Post on 10-Apr-2015

125 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: tics and Homology Modeling

Bioinformatics and Homology Modeling: A Student-Tested Tutorial for Beginners

Exploring Human Visual Pigments

Introduction

This tutorial allows you to explore opsins -- the proteins that catch light for our eyes -- and the genes that code for opsins. But the real subject of this exercise is bioinformatics -- the use of computers to search for, explore, and use information about genes, nucleic acids, and proteins. While learning about the human opsins, you will use some of today's most powerful bioinformatics tools, and you will even build a model of a protein whose detailed structure is unknown (called homology modeling). You can follow up this tutorial with a study of opsins from other organisms, or by exploring any class of biomolecules that interest you.

Please realize that this tutorial merely scratches the surface of what you need to know in order to use bioinformatics wisely in your research. If you want to learn more, including vital guidance in judging the quality of your results, I recommend you turn next to Bioinformatics for Dummies, by Claverie and Notredame, Wiley Publishing, Inc., 2007.

I assume that you are conversant with biochemistry and molecular biology. If you see unfamiliar terms pertaining to the genes, mRNAs, and proteins used as examples here, break out your biochemistry text, head for the index, and review, review, review.

For more information about each database or tool, go to its home page and read, read, read. These tools come with plenty of help.

History

Page 2: tics and Homology Modeling

This web page was originally composed of somewhat sketchy procedures that I devised by playing* with bioinformatics tools on the web. For five years or so, my biochemistry students carried out the tutorial, and their suggestions led to many improvements, as have emails from users around the world.

*My play with bioinformatics tools started with the book Bioinformatics for Dummies, by Claverie and Notredame, Wiley Publishing, Inc., 2003. Not considering myself a dummie in most of my areas of interest, I had never looked very hard at Wiley's "Dummies" books. I'm so glad I looked at this one. The authors are on the frontiers of the field, and they have produced a serious, high quality book. If you want to work through lots of clear tutorials in all areas of bioinformatics, buy it. It was the best $30 I had spent on a book in quite a few years. Just click the title above to learn more about the latest edition, 2007 (only $20 now), which guided my October 2008 revisions of this tutorial.

Many thanks to Professors Claverie and Notredame for this friendly and powerful resource.

NEXT

Bioinformatics Tutorial

Cast of Characters

You will encounter these databases and software tools one by one as you follow this tutorial. Use this page for reference if you can't remember the meaning of an acronym or program name.

I. The Databases

Genbank, operated by NCBI (National Center for Biotechnology Information)Contains all publicly available sequences of DNA, with annotations, which are constantly being extended and updated. Annotations include identification of a genes its gene product(s) (if known), and extensive links to all kinds of information about the gene in other databases. NCBI contains the same DNA sequence content as EMBL (European Molecular Biology Laboratory) and DDBJ (DNA Data Bank of Japan)

OMIM, (Online Mendelian Inheritance in Man—woman, too)An encyclopedia of human genes and genetic disorders, linked to gene entries in GenBank and to scientific literature in PubMed. Gives complete and up-to-the-minute information about many human genes.

PDB (Protein Data Bank)Contains all publicly available experimentally determined (by x-ray crystallography and NMR) structural models of proteins and nucleic acids. Does not contain homology models or other types of theoretical models.

Page 3: tics and Homology Modeling

PubMedDescribed in Wikipedia as "a free search engine for accessing the MEDLINE database of citations and abstracts of biomedical research articles. The core subject is medicine, and PubMed covers fields related to medicine, such as nursing and other allied health disciplines. It also provides very full coverage of the related biomedical sciences, such as biochemistry and cell biology. It is offered by the United States National Library of Medicine at the National Institutes of Health as part of the Entrez information retrieval system."

UniProt Knowledgebase (Swiss-Prot and TrEMBL), operated by SIB (Swiss Institute of Bioinformatics) and EBI (European Bioinformatics Institute).Contains most of the publicly available sequences of proteins (not DNA or RNA). Sequences in Swiss-Prot are annotated manually, and provide or link you to just about all published information about the sequence. Sequences in TrEMBL are collected and annotated automatically from sequence databases, and will make their way to Swiss-Prot, but only after they are manually annotated to meet Swiss-Prot standards.

II. The Tools

BLAST (Basic Local Alignment Search Tool) For searching databases to find genes or proteins with sequences similar to yours

ClustalWFor comparing your sequence with others, or lots of sequences with each other

DeepView (also knows as Swiss-PdbViewer)For seeing and exploring macromolecular models in three dimensions, and for manual and semiautomated homology modeling

ExPASy (Expert Protein Analysis System)Not so much a tool as a tool box -- a very complete set of protein-analysis tools

NCBI Map ViewerFor finding genes and gene products (RNAs and proteins) that interest you, and for seeing where they lie on the set of chromosomes for each organism

PubMedFor searching ALL the literature of the life sciences

PhylipFor making rigorous phylogenetic trees when you want to control all the parameters

PhylodendronFor printing phylogenetic trees using data

PhyMLFor making rigorous phylogenetic trees automatically by a maximum-likelihood method—probably the best, but the slowest

Swiss-Model and the Swiss-Model Workspace For automated building theoretical structural models of your sequence based on known structures (homology modeling)

Page 4: tics and Homology Modeling

TcoffeeLike ClustalW, a tool for sequence comparisons, but more powerful, and can use known structures to improve the comparisons

NEXT

Bioinformatics Tutorial

Start Finding Genes

Preview

In this section, you will use the NCBI Map Viewer and the keyword opsin to get a list of opsin or opsin-related genes in the human genome.

Human Opsins

The subject of this tutorial is human opsins, which are found in the cells of your retina. Opsins catch light and begin the sequence of signals that result in vision. We will proceed by asking questions about opsins and opsin genes, and then using bioinformatics to answer them.

When I provide a web address, I'll also make it a link -- just click it to go to the site in a new browser window. Then make it a bookmark so you can find it again. This tutorial will still be open in the window behind the new one.

WARNING: Bioinformatics tools evolve rapidly, faster than I can make changes to this tutorial. So if a page does not look exactly like I say it should, or if its title is different, look around and try to do what the tutorial says. You should find the same links, but names may be slightly different, or many new links may have been added (bioinformatics pages never get simpler). If the differences are so great that you can't proceed, send me email (see contact link at top of Tutorial Contents), and I'll adapt the instructions to the changes as soon as I learn about them.

Where are the opsin genes in the human genome?

Point your browser to http://www.ncbi.nlm.nih.gov/mapview/. You find a list of organisms for which genome information is available. In the right-hand columns beside each organism are links to tools. Hold your mouse pointer on each tool symbol for a brief description of what it does.

Find Homo sapiens (human), and click on the magnifier tool beside the lowest-numbered Build (a build is an assembly of the genome, which is done repeatedly). We will use the older build because sometimes not all searching and viewing tools are not

Page 5: tics and Homology Modeling

connected to the newest build, which is in progress. The magnifier tool takes you to the Search Page for the organism, which shows a chromosome diagram, and provides input boxes (at top of page) for searches.

In the box next to Search for, enter opsin.

Click Find.

You see the diagram again, with red marks at your "hits", the locations of genes whose entries contain "opsin" as a whole or partial word. Below the diagram is a list of the indicated genes.

If the list is very long, simplify it using Quick Filter box on the right at the top of the list; check the box marked Gene, and then click Filter. If you are already seeing the filtered list, the Quick Filter box will not be present.

In the list of genes related to the search term opsin, there are the rhodopsin gene (RHO), and three cone pigments, short-, medium-, and long-wavelength sensitive opsins (for blue, green, and red light detection). Four hits look like visual pigments, which should not surprise you. To the left of each entry is the chromosome number, allowing you to tell which red mark corresponds to each entry. Note that several hits are on the X chromosome, one of the sex-determining chromosomes.

NOTE: In the human genome lists, you will often see duplicates marked reference or Celera, referring to the results from two major efforts to sequence the human genome. At first, these two efforts were separate, but eventually they came together. When you have a choice, choose "reference," so you will be following the same path I followed in setting up the tutorial.

You can get more details on multiple hits on the same chromosome with the all matches link for that chromosome. Click all matches next to X. Be patient: the next page may load slowly--it's packed with information.

You see a very complicated display (don't sweat -- we're going to use only a part of this). On the left is a diagram of the X chromosome, with red marks at the positions of the gene(s) you've followed to this page -- in our case, the two opsins, medium- and long-wave, which are located near the bottom tip of the X chromosome. To the right are various representations of the X chromosome, with listings of annotated areas. The two opsin genes are highlighted in pink. If you pass your cursor over this page without clicking, you will find that some symbols provide brief information, mostly about regions that are not yet characterized well enough to have a full entry.

As you can see, there is a tremendous amount of information on this page, with links to much more. If you want full information about the meanings of abbreviations and symbols on this page, as well as the kinds of information linked to the page, you can use Map Viewer Help at the top of the page. You will find abundant information about the

Page 6: tics and Homology Modeling

Map Viewer, explanations of all symbols and links, and even tutorials about how to ask and answer all kinds of questions about the genome. The Map Viewer is like the Google Earth of the genome, and as with Google Earth, the amount of information is sometimes daunting.

For now, note the information provided for the the opsin gene OPN1LW (called the gene symbol). You see that this is the long-wavelength-sensitive (red) opsin, and that it's a gene involved in color blindness (a sex-linked trait -- no surprise, because we find the gene on the X chromosome).

NEXT

Bioinformatics Tutorial

All About a Gene

Preview

In this section, you will explore a few links to extensive information about specific genes.

What do scientists know about the opsins?

On the MapViewer page, click OPN1LW.

You have entered the OPN1LW opsin 1 page of Entrez Gene, which is a sort of highway interchange with routing to all sorts of information about this gene. Scan down the page. Some of the information is very plain and understandable, while some is very cryptic. One of the most accessible links is to OMIM (Online Mendeliam Inheritance in Man), a catalog of human genes and genetic disorders. Despite the name, the database includes genes of women, too.

Look down the page and find the Phenotypes section, and notice the links marked MIM. These are links to OMIM entries. Click one of them.

Each OMIM entry tells you about this gene and types of colorblindness, genetic disorders associated with mutations in this gene. Read as much as your interest dictates. Follow links to other information. For more information about OMIM itself, click the OMIM logo at the top of the page. Through OMIM, a wealth of information is available for countless genes in the human genome, and all information is backed up by references to the latest research articles.

Once you've satisfied your appetite, return to the Entrez Gene page (use the Back button of your browser or your browser's history list).

Next to the Display button, pull down the menu and select PubMed (calculated) Links.

Page 7: tics and Homology Modeling

You have entered PubMed, a free database of scientific literature, to the results of a complete search for articles directly associated with this gene locus. By clicking on the authors of each article, you can see abstracts of the article. If you are on a university campus where there is online access to specific journals, you might also see links to full articles. PubMed is your entry point to a wide variety of scientfic literature in the life sciences. On the left side of any PubMed page, you will find links to a description of the database, help, and tutorials on searching.

Now return to the Entrez Gene page for OPN1LW opsin 1.

NEXT

Bioinformatics Tutorial

Finding Sequences

Preview

In this section, you will learn how to obtain nucleic acid or protein sequence information, in a format called FASTA, that is easy to use as input into bioinformatics tools.

What is the nucleotide sequence of this gene?

Remember that you are looking at information about the gene for the red-sensitive opsin in human vision, and it is located near the bottom tip of the X chromosome. On the Entrez Gene page for OPN1LW opsin 1 scroll farther down (way down!) to NCBI Reference Sequences (RefSeq). In the first subsection, mRNA and Protein(s), all of the following are available:

the mRNA Sequence (sequence of nucleotide bases in the messenger RNA), here listed as NM_020061.3 (M for mRNA);

the protein sequence (sequence of this gene's protein product, the red opsin), here listed as NP_064445.1 (P for protein);

the source sequences (entire sequences of the all of the overlapping genome fragments in which this sequence was found, from GenBank).

Note that the two links to mRNA sequence and protein sequence are given as NM_020061.3→NP_064445.1, the arrow implying that the sequence of the NM entry is translated (by protein synthesis) to give the sequence of the NP entry.

Click the entry number for the mRNA sequence: NM_020061.3

This is a typical GenBank nucleotide file, and a lot of it is hard to read, but a few things are clear. First note, under references, citations to the publication of this sequence in the

Page 8: tics and Homology Modeling

scientific literature. To see an abstract of the article in which this gene was described, click the PubMed link (a number) below the first reference and read it.

Scroll to the bottom of this long page. The last thing, labeled ORIGIN, is the sequence of this messenger RNA. You are seeing the actual list of As, Ts, Gs, and Cs that make up the message for synthesis of this opsin. But wait! You know that RNA contains no T. In most nucleotide databases, U from RNA is represented as T, to make for easy comparison of DNA and RNA sequences. This sequence information is not in the form that is most useful for searching in databases, say, searching for related genes. Let's display this entry in a form more useful for searching.

At the top of the page, beside the Display button, pull down the menu that says GenBank (the default display format for each entry), and select FASTA (note that several other display options are available). Now you see one descriptive or "comment" line that begins with ">", followed by the nucleotide sequence. This little bit of text is just what you need to search nucleotide databases for similar sequences.

Keep it for future use, as follows. Click and drag on the web page to select everything from the ">" through the last nucleotides (CCAA). Be careful not to select anything else. From your browser's Edit menu, select Copy to make a copy of this information on your clipboard, for pasting elsewhere. Now start a simple word processor (use TextEdit on Mac, Notepad on Windows—to avoid inadvertent changes in crucial formatting of sequence files), make a new document, and paste. The FASTA comment and sequence should appear. If necessary, select all of the text and change the font to Courier or Monaco -- these "typewriter" fonts make it easy to align letters into columns, because all letter are the same width. Save this file, choosing text or plain text as the file type. Call it mrnared.txt (for mRNA sequence of red opsin). Save it to a convenient location for this and other files you'll be making for later seaches.

Click your browser's Back button until you return to the Entrez Gene page for this gene.

What is the amino-acid sequence of this gene?

Under NCBI Reference Sequences (RefSeq), click the entry number NP_064445.1 for the protein sequence.

Things look a lot like before, but this is a protein entry (the classical view is that gene products are proteins, but many are not), containing the amino-acid sequence in one-letter abbreviations. Just as with the mRNA entry, turn this into a FASTA display, and copy it into a new word-processor document. Save it in text format as protred.txt (for protein sequence of red opsin). Return to Entrez Gene.

What does the neighborhood of this gene look like?

Page 9: tics and Homology Modeling

(Get ready for a surprise. Hint: OPN1LW is a human gene, and humans are eucaryotes. When people began to sequence eucaryotic genes, what big surprise was in store for them?)

Now take a look at the chromosome region that contains the red opsin gene. Scroll back to near the top of the Entrez Gene page for OPN1LW, to the section called Genomic context. The diagram shows you that the red opsin gene lies on the X chromosome, within a segment of base pairs (bp) stretching from position 152,929,151 to position 153,114,725 (a distance of 185,574 bp). [Don't worry if these numbers are not exactly the ones you see; these resources are constantly being updated.] The location of OPN1LW, shown as a red arrow, is about 3/4 of the way down this segment.

Now look at the diagram in the preceding section, Genomic regions, transcripts, and products. This diagram gives a closer look at the OPN1LW segment, representing only positions 153,062,939 to 153,077,701 (14,762 bp). The lower line shows coding regions as red blocks, noncoding regions as red lines. Here is the surprise: You knew, but you might have forgotten, that eucaryotic genes are often interrupted by non-coding regions called intervening sequences or introns. The coding regions are called exons. From this diagram, you can see that the OPN1LW gene consists of 6 exons and 5 introns, and that the introns are far larger than the exons. Of the 14,762 bp in the "gene", only 1095 bp code for protein, which means that less than 8% of the base pairs contain the code. When this gene is expressed in cells in the human retina, an RNA copy of the entire gene is synthesized. Then the intron regions are cut out, and the exon regions joined together to produce the mature mRNA (a process called splicing). which will be translated by ribosomes as they make the red opsin protein. In this case, 92% of the initial RNA transcript is tossed out, leaving the pure protein code. Seems wasteful, but our understanding of how all this works, while impressive, is still pretty fragmentary.

Tomorrow will tell us what eludes us today, but not what eludes us tomorrow.

At the ends of the lower line in the diagram, there are links to NM_020061.3 and NP_064445.1, the entries for the mRNA and protein sequences for this gene. You visited these pages in the two sections above. Click CCDS 14742.1 at the far right of the diagram to go to the Consensus Coding Sequence page for this gene. It shows nicely how the OPN1LW gene transcript is divided into exons. Under Chromosomal Locations for CCDS 14742.1 is a table listing start and end base-pair positions for each exon. Below that is the full nucleotide sequence of the mature mRNA, with alternating blue and black sections indicating exon boundaries. Farther below is the amino-acid sequence, again divided into exons by alternating blue and black, with red indicating amino-acid residues whose codons are partly in one exon and partly in the following exon. This makes it dramatically clear how the mRNA is pieced together from the exons.

You still have not seen any of the actual sequences of the introns. Return to the Entrez Gene page for OPN1LW. Under Genomic regions, transcripts, and products, click Go to reference sequence details. This takes you down the page to NCBI Reference

Page 10: tics and Homology Modeling

Sequences. You were here before, to retrieve the mRNA and protein sequences. This time, click the sequence of four entry numbers (all one link) beside Source Sequence(s). This takes you to the Entrez Nucleotide page that contains information about all four of the genome fragments from the Human Genome Project that contain all of part of the red opsin gene, along with information about how each clone was produced. This entry thus shows the gene in the larger context of the cloned fragments in which the gene was found. These sequences allow you to explore flanking regions around the gene, which might be useful in designing PCR primers for making useful quantities of this region. From this page, you could also find neighboring sequences if you wanted to look farther afield. As before, you can display this entry in FASTA format. You will get a series of entries, each a different clone that was used to construct this region of the genome.

NEXT

Bioinformatics Tutorial

First BLAST Search

Preview

In this section, you will use a FASTA sequence as an input (query) to BLAST, a program that searches a genomic database for similar sequences (hits). You will also learn how to judge whether a hit arises by chance or by common ancestry.

What proteins in humans are similar to the red opsin?

Now return to the NCBI Map Viewer. You will search the human genome for sequences similar to that of the red opsin.

Click the BLAST symbol (circled B) next to Homo sapiens (human).

This is the NCBI's BLAST search tool. BLAST is a widely used program for finding sequences similar to a "query" sequence that you're interest in. Pick these options from the various menus:

Database: Build Protein for PREVIOUS build (look at bottom of the Database menu). This means that you will search the protein sequences in the previous build of the database. (Sometimes not all tools needed later are available in the latest build, which is currently under construction.)

Program: BLASTP (Use the version of BLAST that compares protein sequences, unlike BLASTN, which compares nucleotide sequences.)

Other Parameters: Make no changes.

Next, copy the FASTA data from your file protred.txt to your clipboard, and paste it into the BLAST search box, above which it says, "Enter an accession..." Check to be sure that

Page 11: tics and Homology Modeling

the first character in the box is the ">" at the beginning of the FASTA data. Then click Begin Search.

The next page is for formatting your search results. Accept all default settings, and just click the View Report button. When your results are ready, the results of BLAST page appears. Look down the page to the Graphic Summary, a box containing lots of colored lines. Each line represents a hit from your blast search. If you pass your mouse cursor over a red line, the narrow box just above the box gives a brief description of the hit. You'll find that the first hit is your red opsin. That's encouraging, because the best match should be to the query sequence itself, and you got this sequence from that gene entry. The second hit is the green opsin -- remember that the PubMed entry reported that the red and green pigments are the most similar. The third and fourth hits are the blue opsin and the rod-cell pigment rhodopsin. Other hits have lower numbers of matching residues, and are color coded according to a score of matches. If you click on any of the colored lines, you'll skip down to more information about that hit, and you can see how much similarity each one has to the red opsin, your original query sequence. As you go down the list, each succeeding sequence has less in common with red opsin. Each sequence is shown in comparison with red opsin in what is called a pairwise sequence alignment. Later, you'll make multiple sequence alignments from which you can discern relationships among genes.

See what you can figure out about what the scores mean. Identities are residues that are identical in the hit and the query (red opsin), when the two are optimally aligned. Positives are residues that are very similar to each other (see residue number 1 in the blue opsin—it's threonine in red opsin, and the very similar serine in the blue). Gaps are sometimes introduced into a hit to improve its alignment with the query. The more identities and positives, and the fewer gaps, the higher the score. Note that blue opsin and rhodopsin are only about 45% identical to the red opsin. Other proteins, which are apparently not visual pigments, have even lower scores.

Interlude: Expectation Values and Blast Scores

The displays contain two prominent measures of the significance of the hit, 1) the BLAST Score [lableled Score (bits)], and 2) the Expectation Value (labeled Expect or E).

The BLAST Score indicates the quality of the best alignment between the query sequence and the found sequence (hit). The higher the score, the better the alignment. Scores are reduced by mismatches and gaps in the best alignment. Calculation of the score is complex, involving a substituion matrix, which is a table that assigns a score to each pair of residues aligned. The most widely used matrix for protein alignment is known as BLOSUM62.

The expectation value E of a hit tells whether the hit is likely be result from chance likeness between hit and query, or from common ancestry of hit and query. (If E is smaller than 10-100, it is sometimes given as 0.0.) The expectation value is the number of

Page 12: tics and Homology Modeling

hits you would expect to occur purely by chance if you searched for your sequence in a random genome the size of the human genome. E = 25 means that you could expect to find 25 matches in a genome of this size, purely by chance. So a hit with E = 25 is probably a chance match, and does not imply that the hit sequence shares common ancestry with your search sequence. Expectation values of around 0.1 may or may not be biologically significant (other tests would be needed to decide). But very small values of E mean that the hit is biologically significant; that is, the correspondence between your search sequence and this hit must arise from common ancestry of the sequences, because the odds are are simply too low that the match could arise by chance. For example, E = 10-18 for a hit in the human genome means that you would expect only one chance match in one billion billion different genomes the same size of the human genome.

The reason we believe that we all come from common ancestors is that massive sequence similarity in all organisms is simply too unlikely to be a chance occurrence. Any family of similar sequences across many organisms must have evolved from a common sequence in a remote ancestor.

One place to find out more about BLAST searches and statistics is The BLAST Sequence Analysis Tool in the NCBI Handbook.

Now you will see where all these hits are found on human chromosomes.

Where (in the human genome) are all the genes for these other proteins?

Just above the Graphic Summary, click Human Genome View.

You have come full circle. You are back at the human chromosome diagram, and you see all the hits of your search, in the colors that signify their BLAST scores as they were shown in the Graphic Summary. Notice that there are about 100 proteins that have 40% or more positives in alignment with red opsin. The opsins are members of the much larger family of G protein-coupled receptors, key players in signal transduction.

NEXT

Bioinformatics Tutorial

Family Relations

Preview

In this section, you will learn how to gather a group of related sequences in FASTA format, and then use them as inputs to the program ClustalW. The result is a multiple-

Page 13: tics and Homology Modeling

sequence alignment (MSA), from which you can deduce much about how the sequences resemble and differ from each other. Then you will use the MSA as input to tree-printing programs, in order to produce a phylogenetic tree—a visual summary of relationships among the genes.

How are the opsin genes related to each other?

Answering this question requires making a multiple sequence alignment and then using it to make a phylogenetic tree. For these tasks, you move to another database where it's a little easier to gather a bunch of sequences into a single FASTA file.

Point your browser to http://www.expasy.ch/.

You see the home page of ExPASy, the Expert Protein Analysis System. As stated in the Cast of Characters, ExPASy is a complete protein tool box. With ExPASy, you can do almost any imaginable analysis or comparison of protein sequences and structures. In my humble opinion, Swiss sequence database tools are among the easiest ones to use.

Click UniProt Knowledgebase (SwissProt and TrEMBL) under Databases.

Read the introduction to these databases. They are high quality protein (not nucleic acid) sequence databases with abundant annotation, minimal redundancy, and many connections to other databases.

Click New UniProt Website. The new (2008) home page of UniProt contains links to information about the resource. Click to learn more about the site, and then return to this page. Bookmark this page (UniProt Welcome) as a good starting point for future use of UniProt, Swiss-Prot, or TrEMBL.

At the top of the page is a deceptively simple but powerful search tool. A menu lets you choose among data sets to search. Take a look at the list on the menu, put return it to Protein Knowledgebase (UniProtKB).

In the Query box, type opsin. Click Search. The search produces over 4000 entries, all of which are protein entries that are opsins or include the word or fragment -opsin-. Obviously, you need to be more specific.

Limit the search to human opsins, as follows. Click Fields, beside the Query box. The Search area expands to include a logical operator menu (with default operator AND), a Field menu, and a Term box. Under Field, pick Organism. In the Term box, start typing human. As you type, the search tool helpfully shows you all allowed search terms that fit what you have typed so far. As soon as human [9606] appears, click it to enter it in the Term box, and click Add and Search.

Notice that the Query box now says "opsin AND organism: human [9606]". This shows that you have limited your search to opsin-related entries that are also (AND) human

Page 14: tics and Homology Modeling

proteins. Notice also that the Fields link is available again, so that you could add additional terms to your search, with logical operators AND, OR, and NOT to specify how to use the additional terms. But the search is already specific enough to make our task easy: there are only 25 results for this search.

Before looking at the results, look at the other Fields you can search. UniProt entries are files that are divided into sections, called fields, each containing specific kinds of information. You can limit searches to terms that reside in specific fields, or can simply search for your query in entire entries.

Now look over the results. On 2008/09/19, this search gave 25 hits, including the rod pigment rhodopsin (OPSD), along with the three cone pigments (OPSB, OPSG, OPSR). There is also a "visual pigment-like receptor peropsin", OPSX, which still, more than ten years after its discovery in the genome, is of unknown function. In the rest of this tutorial, you will include this mysterious protein in your inquiries into the visual pigments of the human retina.

Digression

Now you will digress briefly from the question of how these proteins are related evolutionarily, and find out more about peropsin. In the process, you will glimpse the wealth of information in, and linked to, a typical UniProt entry.

In the Accession column, click O14718, next to OPSX_HUMAN.

By the way, an accession number such as O14718 can be used as an iput to almost any ExPASy tool for analysis of the corresponding sequence.

You see the UniProtKB View of entry O14718 [note: that first character is capital letter O, not zero (0)]. Peruse this entry and try to find out just what this rhodopsin-like protein is thought to do. Under General annotation (Comments), you'll learn that it is found in the retina (the RPE or retinal pigment epithelium), and that it may detect light, or perhaps monitors levels of retinoids, the general class of compounds that are the actual light absorbers in opsins. Also under Similarity in the same section, you see, as mentioned earlier, that this protein is a member of the large family of G protein-coupled receptors (GPCRs). If you click G-protein coupled receptor 1 family, you conduct a search for a members of this family—the result is about 10,000 hits in UniProt. Limit this search to humans (about 1200 hits). Back on the O14718 page, click Opsin subfamily to find a list of all purported members of this subfamily in UniProt (about 220). Limit the search to humans (fewer than 20).

Once again, back up to the UniProtKB entry page for O14718.

Under References find the journal citation, "Peropsin, a novel visual pigment-like protein located in the apical microvilli of the retinal pigment epithelium.". Click the PubMed link with that reference to see an abstract of the paper. On the abstract page, click on of

Page 15: tics and Homology Modeling

the Free Full Text Article links to obtain the full paper from either the journal (PNAS) or from PubMed Central, which distributes many articles. Like many journals, PNAS puts full articles online just 6 to 12 months after publication.

Return to O14718, and look around more on the entry page. You will find Cross-references to this protein or its gene in other databases, predicted structural features of the protein, and the sequence, which you can lift in FASTA format if you wanted to search for more of its relatives. Note also links to a number of ExPASy tools listed for further analysis of this sequence.

Try one of them: under Cross-references, find PROSITE, and click Graphical view.

You now have a form that allows you to search for signatures of function or functional sites in peropsin. Leave all settings as they are, and click scan next to the graphical image (green) of the protein. Here is another form, with the accession number O14718 already entered. Again, leave all other settings as they are (but notice that there are many ways to modify this search), and click START THE SCAN.

PROSITE finds three identifiable things about this sequence. One "hit by profile" identifies peropsin as a G-protein coupled receptor. Two "hits by pattern" are shown. One is a short sequence that also identifies peropsin as a GPCR, while the second hit identifies a binding site for retinal. So PROSITE indicates that, like its visual opsin relatives, peropsin also binds specifically to retinal, the visual pigment that we make from vitamin A. Note also that, by similarity to other related proteins, PROSITE predicts the presence of a disulfide bond, between residues 98 and 175.

(Later, you will find out more about the three-dimensional structure of peropsin by building a model of it. You will use a related protein of know structure as a template for making this model. This process is called homology modeling.)

End of Digression

Next you will answer the main question of this section: how are the visual pigments (and peropsin) related to each other? Apparently, they diverged from a common ancestral opsin, but you can get a much clearer picture of which of these opsins came first, and which are the most closely related. To answer this question, you will align all their sequences (called a multiple sequence alignment) and then produce a little family tree. UniProt provides easy access to ClustalW, which does multiple-sequence alignments in a snap, as well as the information needed to print a phylogenetic tree from the alignment information.

Return to the UniProt search results, with its 25 hits for entries from the human genome that include the description "opsins". Your next task is to compare the sequences of peropsin and four visual pigments. Start by clicking to put check marks in the left-hand column of the results table, beside the first four entries (rhodopsin and the blue-, red-, and green-sensitive opsins) and also in the row for peropsin, O14718. As you put in the first

Page 16: tics and Homology Modeling

check mark, a green band appears at the bottom of the window, providing a tool bar with options for handling multiple sequences. After you have checked the entries as instructed, click the Align button in the green tool bar. This is a request to use ClustalW to make a multiple-sequence alignment using the selected entries.

The Clustalw results page appears. At the top, in the Sequences box, are FASTA-format listings of all the sequences compared. Take a moment to edit this listing to make subsequent alignments and trees easier to interpret. In the FASTA sequences listed in the Sequences box, make the follow changes:

1. Change P03999 to Blue 2. Change P08100 to Rhodopsin 3. Change P04001 to Green 4. Change P04000 to Red 5. Change O14718 to Peropsin

After editing, click Align to redo the alignment with new headings.

To save this alignment in a form needed for the next section, click the orange TEXT button to the right of Clustalw Results. Copy the text file that is displayed, paste it into a new text file, and name it OpsinMSAEdited.txt. Now back up to the Clustalw Results page.

Below the table that names each opsin with your new headings is the multiple sequence alignment. In blocks of 60 residues, Clustalw has aligned five sequences. Below each column of five residues, symbols indicate how closely the residues match across the five proteins. "*" means all 5 aligned proteins have the same amino-acid residue in this position (fully conserved residues, within this group); ":" means that all residues in this position are very similar in size, charge, and polarity (replacements are very conservative); "." means that they are sort of similar (somewhat conservative replacements); and no symbol means that the residues in that position vary greatly in properties (nonconserved residues). (What does each symbol suggest about the importance of that residue to the function of this protein family?)

At the bottom of the results page are several tool bars. Play with the first two to see what they do. You will find that they modify the display of the multiple-sequence alignment to highlight residues types or signatures of protein function. Using these tools, you can get a general picture of similarities and differences among the proteins. But the comparison can be made much more explicit by using it to make a phylogenetic tree for this group of proteins. The last tool bar should provide a ClustalW tree, but as of 2008/09/20, clicking show on this toolbar opened up a blank space, but the tree never appeared (if it's fixed now, send me email to let me know).

As you can see at the bottom, this page provides the information needed to print a tree, and a tool at the University of Indiana can use that information. Unfortunately, this tree is not a true phylogenetic tree (I still don't know about the one that is not displayed yet); it

Page 17: tics and Homology Modeling

is a simple tree that shows the order in which ClustalW carried out pairwise alignments as it built the multiple-sequence alignment. It will show the pairs that are most closely related to each other, but you must use a more powerful tree-generating program to obtain a more rigorous tree.

NOTE: This type of ClustalW working tree file always has a .dnd suffix. For really good phylogenetic trees, do not use .dnd files.

Anyway, we can use this tree just to learn how to print trees once you have a good one from any source (next section). This procedure will work if you have tree data in Newick format, which is true for the tree file provided on this page. Get the file you need to make a tree by going to the top of the page and clicking the orange TREE button. Your browser display a very small text file, littered with parentheses. Copy and save this file as ClustalwTreeData.txt. This is tree data in Newick format, a widely used format for tree-printing programs. You will use the data in this file to print your first tree.

A convenient tree printer, Phylodendron, is located at http://iubio.bio.indiana.edu/treeapp/treeprint-form.html. When you point your browser to this URL, you find the input form for the phylogenetic tree printer.

Paste the contents of your ClustalwTreeData.txt into the Tree data box near the top of the form. Type a title into the Title box, something like "Opsin Family Tree". To get a tree that looks like mine (below), pick Phenogram from the Tree styles at the top. Then under Extra Options, select Format: GIF image; width and height: 400 pixelsFont: Helvetica; Style: plain; Size: 12. Leave all other settings as you found them, and click Submit.

Your tree should appear in your browser. Save it OpsinTree.gif. Be sure to remove ".cgi" from the default name, so that your file will be recognizable as a normal GIF file. You can paste these files into documents for reports and publications. Play around with other options at Phylodendron, and see how they affect the tree image.

With the settings given above, my tree looks like this:

Page 18: tics and Homology Modeling

In a true phylogenetic tree (this is not), the horizontal dimension is time. The vertical dimension is extent of sequence change. Each tip represents a sequence at the present time. Each fork represents an ancestral sequence, and an event of divergence between two current sequences. The horizontal distance between a fork and the tips of the fork represents the time since divergence, and the vertical distance between tips represents the amount of sequence difference between the tips.

Like this tree, most trees produced by bioinformatics tools are unrooted trees; that is, the tree shows distances, based on sequence differences, between the tips, but it does not attempt to show the tips and branches in order of their appearance in time. Sequence-comparison programs cannot figure out the order or direction of evolution. They can only assess the magnitude of sequence differences. If you know which sequence is the progenitor of all the others (we don't, in this case), you can root the tree with that sequence. The result will be that the first branch will separate that sequence from the others. The tree above happens to be rooted with peropsin, so it shows the first branch as the divergence of peropsin from the progenitor of all the other opsins. More advanced tree-building programs allow you to choose the root sequence for a tree, but remember that sequence information alone will not tell you the root.

Page 19: tics and Homology Modeling

Beware!

The conclusions of the previous paragraph are based on examining this printed tree. We will see later that this tree is very similar to a tree made by a more rigorous method. This simply means that this particular tree is an easy one to determine. Most trees are not so easy, and more rigorous methods will give results that are substantially different from ClustalW's little working .dnd file.

Remember also that the truth of any conclusions drawn from a tree depends on the accuracy of the multiple sequence alignment and on the alignment scores. In this tutorial, you are using default settings on many hidden parameters in the processes of comparing and aligning sequences. If you want to draw conclusions about phylogenetic relationships that will hold up to scientific scrutiny, you need to learn much more about the inner workings of alignment tools like Clustalw.

In the next section, you will make this tree two more times, using more rigorous tools for calculating phylogenetic distances.

NEXT

Bioinformatics Tutorial

Improved Relations

Preview

In this section, you will learn how to use a few tools from the phylogeny-analysis program Phylip to make a phylogenetic tree by a more rigorous method, called neighbor joining.

Neighbor-Joining Trees From ClustalW Multiple-Sequence Alignments

Point your browser to http://bioweb2.pasteur.fr/intro-en.html and click Phylogeny.

This is one home of the program Phylip, One of the most rigorous tools for constructing phylogentic trees from aligned sequences.

Under Computation of Distance, Phylip, click protdist.

You are about to run protdist, a program that computes the "distance", or the quantitative amount of difference, of protein sequences from each other. These so-called distance matrices will be used by Phylip to construct your tree. The input to protdist is the multiple-sequence alignment you made using Clustalw (file: OpsinMSAEdited.txt)

Page 20: tics and Homology Modeling

Enter your email into the top box.

In the alignment file box, paste your edited mutiple sequence alignment from ClustalW (OpsinMSAEdited.txt).

Under Bootstrap Options, make these settings:

Check the box for "Perform a bootstrap before analysis" Enter any odd number for a seed Enter 100 replicates

Leave other settings as you found them, and click Run.

protdist constructs distance matrices by a process called "bootstrapping". Bootstrapping is a bias-reducing procedure in which protdist builds an alignment of pseudosequences by picking residue positions at random and stringing the residues at those positions together until the sequence is the same length as the original ClustalW alignment. From this pseudosequence alignment, protdist determines the relative number of sequence difference among the five proteins, as determined from a random sampling of their sequences. The result of the process is a called distance matrix, and you will see it soon. This process is repeated, 100 times in our case, to make 100 distance matrices. The tree we will ultimately produce represents a consensus of the 100 matrices.

On the results page, look in the outfile window to see the 100 matrices containing numbers that represent the relative number of differences among the five sequences. Each matrix has the sequence names in the first column, and you should imagine that these sequence names are also the headings for the remaining columns. The number at the intersection of the row Blue and the column with the imaginary heading Peropsin gives the relative magnitude of the sequence differences between the blue opsin and peropsin. The matrices have zeros on the diagonal because each pseudosequence is identical to itself. Click the Save button to save the entire file of 100 matrices. The file is automatically downloaded with the name protdist.outfile.txt. Transfer the file to a convenient place.

Clicking the Back button of your browser from a results page takes you back to the Phylogeny page. Under Distance Matrix Method programs, Phylip click neighbor. Read the lists carefully: don't pick "weighbor".

Into the Distance matrix File window, paste the contents of the file protdist.outfile.txt. Under Bootstrap options make these settings:

Check "Analyze multiple data sets (M)" Enter 100 data sets (using all of the replicates from protdist) Enter an odd number for a seed Check "Compute a consensus tree"

Page 21: tics and Homology Modeling

 Scroll down to Other options.

This entry area gives you the option of designating an outgroup for the root of your tree. An outgroup is the sequence you think is most distant from the others, possibly the common ancestor of all. We don't know that in this case, so leave the default of 1.

 At the top of the page, click Run.

 On the results page, the Newick file you need to make the tree is neighbor.outtree. Copy and save it as PhylipTreeData.txt.

By scrolling down in the consense.outfile window, you can see the consensus tree, printed in a simple text format. This tree is listed as "unrooted", meaning that we do not know the ancestor of all these sequences. We learn from this tree which sequences are most alike and which are most different. We also learn how often the connections of this tree were made the same way in the 100 trees made from those 100 difference matrices. The numbers on the branches indicate the number of times that partition of the species into the two sets separated by that branch occurred among the 100 trees. For example, the separation of Red and Green from the other three, indicating that Red and Green are more similar to each other than to the other three, occurred in all 100 trees. The separation of Blue and Peropsin from the other three occurred in only 53 of the 100 trees. In the other 47 trees, Rhodopsin and Peropsin were separated from the other three. (Can you extract this information from this file?) In the tree branching shown, the majority rules, and the results of 47 of the trees are discarded.

Note: Your results may be slightly different from mine. Because of the random choices made in constructing the tree, the percentages in the paragraph above my vary. I have gotten as high as 82% consensus on the separation of Blue and Peropsin from the other three.

Using what you learned in the previous section, go to http://iubio.bio.indiana.edu/treeapp/treeprint-form.html and produce a tree from data in your file PhylipTreeData.txt.

Here is my tree:

Page 22: tics and Homology Modeling

Interpreting a tree is not as simple as interpreting the types of trees that you see in textbooks. The Phylip tree apears to say that the divergence of Blue from Rhodopsin came before the divergence of Rhodopsin from Peropsin. But remember that this tree is unrooted; we did not specify which protein we think is the progenitor of the others. The tree-printing program automatically puts a little root on the tree, but that line is not necessarily the beginning or "bottom" of the tree. We can start from any branch and read the tree as if that were the first branching event on the tree. What the tree does tell us is which sequences are the most similar. Clearly, Red and Green are the most similar pair, and Blue is more similar to rhodopsin than is is to peropsin.

Playing With Tree Roots

Next, you will use some of the latest tools to make a multiple-sequence alignment (Tcoffee) and a tree (PhyML). These programs are even more powerful, but with power comes somewhat less transparency, and a cost in speed. The experts say that the results are better, but we pretty much just have to take their word for it. PhyML also uses a bootstrapping approach, but with greater redundancy than Phylip. The really neat thing about PhyML is that it lets you play with the tree in many ways, including changing roots interactively.

For making a multiple-sequence alignment with Tcoffee, you need raw FASTA files. To get them,

Page 23: tics and Homology Modeling

Return to UniProt, and repeat your search for human opsins. Select the four visual opsins plus peropsin, and click Retrieve at the bottom of the

page. Clikc Open under FASTA on the UniProt Jobs page. Select the text that appears. You might want to save it into a text file, but you can

just paste it into Tcoffee directly. The file you have here is simply the five opsin sequences, one after another, in FASTA format, which is just what Tcoffee needs.

Point your browser to http://tcoffee.vital-it.ch/cgi-bin/Tcoffee/tcoffee_cgi/index.cgi?stage1=1&daction=TCOFFEE::Regular

Paste your FASTA data into the space provided. Enter your email address. Click Submit. That's all there is to it. After what is usually a short delay, a results page appears. It provides links to your multiple-sequence alignment in several formats. (You might find it interesting to compare the alignment from Tcoffee to the one you got from ClustalW. This is easiest to do with the Tcoffee file clustalw_aln.) The file you want for producing a tree is labeled phylip, which provides the alignment in Phylip format, which is needed for PhyML. Click phylip to see this file, select all the text displayed, and copy it. Paste it into a text file, 5Opsins4PhyML.txt).

Point your browser to http://atgc.lirmm.fr/phyml/

PhyML uses maximum-likelihood methods, which are based on very powerful (but obscure) Bayesian statistics, to calculate the tree that has the highest probability of showing the correct relationship among the aligned sequences. Maximum-likelihood methods are among the most highly respected means of making decisions when you must navigate a minefield of probability-based choices to arrive at a either a single best decision, or a small group of similar good ones (X-ray crystallographers use it to decide which data to use, and which to exclude, when trying to build a model of a protein from diffraction data). As the availability of such methods has grown, so has the number of people for whom they are completely black boxes. When you use a black-box method, you must be careful to compare the results with everything else you know about the subject. A surprising result might be a genuine discovery, or it might be just wrong. It is a result to test further, not to accept blindly.

Now put this black box to work.

In the PhyML form, make these settings:

Sequences: File; then click Choose File, and choose the phylip file you saved from the Tcoffee output.

Data Type: Amino Acids

Sequence File: Interleaved

Page 24: tics and Homology Modeling

Number of data sets: 1, also click Perform bootstrap

Number of bootstrap data sets: 100 (do not click Print bootstrap info.)

Bioinformatics Tutorial

Seeking Structure

Preview

In this section, you will learn how to use a FASTA sequence as a search input (query) to the Protein Data Bank, the repository of almost all protein models that have been deduced by X-ray crystallography or NMR. Your search will tell you whether anyone has produced an experimental model of your query protein, or whether models are available for any protein of similar sequence. You will also visualize the model using an online graphics tool. Finally, you will learn how turn a long list of hits into an interactive Custom Report that makes details of each hit easy to find.

What is the structure of an opsin?

By now, perhaps you are curious about the structure of peropsin, but it's not likely that the structure of a protein of unknown function has been determined. It is likely that all opsins are similar in structure, so you can try to find a model of a similar sequence in the database for macromolecular structures, the Protein Data Bank (PDB). It will give you an idea of what kind of protein molecule an opsin is.

In fact, the PDB does not contain molecular structures at all. Is is better to say that it contains models of macromolecules. These models are interpretations of data from one of the two main methods of macromolecular structure determination: X-ray crystallography and NMR spectroscopy. When researchers make a model, or as they commonly say, "determine the structure" of a macromolecule, they deposit a file containing the three-dimensional coordinates of all the atoms in the model. This coordinate file—along with an online molecular graphics tool (like the PDB's Jmol Viewer) or a computer graphics program like DeepView—are all that you need to see and study the model on your computer. Next you will retrieve a model from the PDB and view it with an online graphics tool. You will also visit the home of a topnotch computer graphics program that you can download FREE and use on your home computer.

Point your browser to http://www.rcsb.org/pdb/.

The PDB home page contains a simple search box at the top. You can search for models using simple keywords or PDB ID codes. An PDB code has four characters, like 1CYO. How would you ever know a model by its code? When a new structure is published, the authors usually give the PDB code in the last reference of the bibiography. With that code, you can go straight to the model you want to see. But more often, your question,

Page 25: tics and Homology Modeling

like ours, is more general. For such cases, PDB also provides forms for more sophisticated searches. For now, let's just see if any opsin models are availalble. Type "opsin" into the search box, make sure the PDB ID or keyword is selected, and click Site Search.

On 2008/09/22, this search returned only one model, which is quite puzzling, because a search for "rhodopsin" returns 48 models. So it appears that the quicky (quirky?) search tool at the PDB still needs some work. But this shortcoming is a gift for now. You have bagged an experimental model of an opsin; the PDB contains only models derived experimentally—either by x-ray crystallography or NMR spectroscopy. Now take a look at this one.

Click the PDB file code 3CAP above the tiny image of the model.

You have come to the Structure Summary page for this model, which is its home page at the PDB. This page is connected to just about everything you could possible do with this model. At the PDB, your first goal is always to get to the Structure Summary page for the model you are seeking.

NOTE: Structure Summary does not exactly jump out at you on this page. It's the tab selected over the main part of the entry, and it is a sub-tab of the Structure tab above the left column. Those tabs should be more prominent—they are what distinguishes each of the important pages in the PDB. If you want to know where you are in the PDB, look at the two sets of tabs at the top of the page. The set on the left are main tabs, and the set on the right are sub tabs of the main tabs. Main tabs take you to PDB's major sets of tools, and sub tabs subdivide them. Sub tabs under the Structure tab open LOTS of additional information about the currently chosen model.

In the left column of all PDB pages, you find a set of nested menus (they might vary on different PDB pages). Click Display Molecule to open the PDB display options. If you already own or use one of the listed viewers, like the free program DeepView, you are in business. Click your viewer to download the model and view it in a familiar environment. But first behave as if you are new to all this (perhaps you are), and use a handy viewer that works in your browser.

Click Jmol Viewer. Assuming that your computer has up-to-date Java software, your browser will load the viewer, and it will load the file 3CAP. Your should see models of two rhodopsin molecules—with backbones shown as ribbon-like cartoons, one green, one blue—and several ball-and-stick models of smaller molecules. Is rhodopsin a dimer? No, but in the crystals of rhodopsin from which this model was derived contained two rhodopsin molecules per asymmetric unit (the smallest portion from which the entire unit cell of the crystal can be constructed). PDB files usually show the full contents of the asymmetric unit. If more than one molecule is present, they are referred to as chains in the model.

Page 26: tics and Homology Modeling

NOTE ON VIEWERS: The viewer embedded in the viewing frame of this page is the widely used Jmol, which you will find in use as a molecular viewer at many web sites. If you take time to get to know this viewer fairly well, you will get more out of the many sites that use it. Like most of the other viewers listed at PDB, Jmol is quite limited in its capacity for analysis of protein structure.

In my humble opinion, the most powerful protein-analysis tool listed at PDB is DeepView. DeepView may be the only protein-structure viewing and analysis tool you will ever need. You will learn about it in if you continue into the homology modeling section, later.

Here are some other things you can do to get to know models in a Jmol frame (to get back to the original rendition, reload the page):

Click/drag (left button if you have more than one) on the image to rotate the structure. You should be able to tell that is has a lot of alpha helix.

Hold down option (for Macintosh; alt for Windows) and click/drag to zoom in (drag towards you) or out (drag away) or the rotate the model in the plane of the screen (drag left or right).

Hold down ctrl (or right-click) the image: up pops a set of menus, and if you browse around on them, you'll see that there is much more to Jmol. Try just a couple of things to get some general ideas, as follows.

Using the pop-up menus, Select:Protein:All. EXPLANATION: This means to slide to Select on the main pop-up menu, then on its submenu, slide to Protein, then on its submenu, slide to All. On my Macintosh computer, if I right-click a menu or submenu item, its submenu locks on display, and it's easier to navigate.Nothing appears to happen. You have selected part of the model (the protein part, but not the small molecules). Subsequent commands will change only this aspect of the display.

Color:Structure:Cartoon:By Scheme:Secondary StructureThe cartoons become red (well, bright pink) for alpha helix, and yellow for beta sheet. You probably had not noticed the beta sheet in the models before. Look one of the chains over carefully to get a feeling for its structure. How many helices are present? How many strands of beta sheet? Are the strands parallel or antiparallel?

Do you know how to view stereo pairs? (If not, click HERE to learn how.) Then Style:Stereographic:(choose your favorite mode, cross-eyed or wall eyed viewing). NOTE, As of 2008/09/22, you get the opposite of what you pick. Despite my attempts to inform the programmers about this, cross-eyed viewing gives you wall-eyed, and vice versa. But anyhow, now you can see the model as a solid object with convincing depth. If you are ever going to do anything serious with protein structure, you'll need to find a way to view them in 3D.

Work in stereo or not, as you prefer. Clear the display: Select:None; then Select:Display Selected Only. The display goes blank; nothing is selected and you are displaying only the selection (very logical!).

Select:Protein:All (means select both backbond and sidechains). Then Style:Scheme:CPK Spacefilling. The protein portion is now show as a

Page 27: tics and Homology Modeling

spacefilling model. In this rendition, you get a good idea of the overall shape of the protein. Unfortunately, the Jmol menu does not allow you to color the two chains separately or get rid of one of them.

Style:Scheme:Wireframe. Now you see all of the protein parts of this model in wireframe. This is not as impressive as some other schemes, but is actually the most useful when you start exploring models in detail, because the wires do not hide each other like ball and sticks or spacefilling models.

To learn more about Jmol, consult the help links at PDB below the display. You can also find extensive help for all viewers listed there. But if you plan serious protein structure work, especially judging model quality and comparing models by superimposing them, get to know DeepView.

Finding Opsin Homologs in the PDB

Next, you will try to find other models in the PDB that are homologous to the human opsins. You will ask the PDB, in effect, to "list all models whose sequences can be aligned with that of human red opsin, in order of sequence similarity." In PDB terminology, the red opsin sequence is the query, and similar models found (hits) are called subjects.

First, open your query file protred.txt (FASTA sequence of red human opsin), and copy the sequence portion only to the clipboard; omit all of the comment line that begins with >.

At the top right of any PDB page, click Search. From the list of search types, click Sequence. On the resulting page, click the button next to use Sequence, and paste your red opsin sequence into the box just below. Not that the search tool is your new friend Blast, and that a E cut-off value of 10 is given as a default. From what you learned earlier, you know that this is not a very restrictive search criterion, so your search should pick up anything remotely similar in sequence to the red human opsin. Click the search button. The search tool is now looking for PDB models whose sequences are similar to the human red opsin sequence. Hits in UniProt are just other proteins, most of whose structures are not known. Hits in the PDB are models, so hits tell you that there are experimental models for one or more proteins that are similar in sequence to your query.

On 2008/09/22, I got 26 subjects, or 26 PDB models whose sequences are homologous to the search sequence. Each is listed with an E-value, which is the probability that the sequence similarity between query and subject is a coincidence. The first result or subject is PDB model 1F88, a model of bovine rhodopsin. The E-value is 6.2 x 10 -74 . In other words, while the probability that a coin flip and your call will agree just by chance is 0.5, the probability that the similarity between human red opsin and bovine rhodopsin is just a chance occurence is

0.000000000000000000000000000000000000000000000000000000000000000000000000062,

Page 28: tics and Homology Modeling

which means, to any sane biologist, that these two molecules descended from a common ancestor. There is no chance that, in the history of the universe, two proteins could arrive at sequences this similar by chance. This also means that the structure of the bovine rhodopsin is a sure bet to be very similar to that of the human red opsin, whose structure is unknown (if if were known, this search would have found it).

Now look down the list of the models you found. Most are models of the same substance: bovine rhodopsin (lumirhodopsin, bathorhodopsin, and some others are altered forms that represent rhodopsin in different stages of the visual cycle, but notice that all of these come from Bos taurus, from which the good old barnyard cow got the name Bossy. A few hits are the recently published beta-2-adrenergic receptor, the first G protein coupled receptor model besides rhodopsin. Perhaps by the time you take this tutorial, there will be more.

Use the results page to answer these questions about the comparison between human red opsin and the bovine rhodopsin in PDB 1F88:

1. How many corresponding residues, and what percent of the residues, do the two proteins have in common (exact matches)?

2. How many and what percent of corresponding residues are similar in chemical properties?

3. How many gaps did the alignment program introduce, and how many residues in each gap, to get best alignment between human red opsin and 1F88?

4. Find the longest string of exact matches between the two proteins. How many matches does it contain, and what are the beginning and ending residue numbers?

Reports: Simplifying a Search Through Many Hits

Results pages are difficult to deal with if you want to look around on a long (anything more than 10) list of subjects (hits). To make a display that is easier to navigate, in the left column, click Tabulate, and then Custom Report. You can use this Custom Tabular Report form to generate a list of your subject that includes any features of interest. For now, you will generate a very simple list, but you will quickly see its power.

On the form, click to put checkmarks in these boxes: Descriptor (under Structure Summary), and Source (under Biological Details). Then click Create Report at the bottom of the form.

The custom report appears, with three columns, PDB ID code, model descriptor, and biological source of the protein. The form contains many clickable items. Clicking an ID code takes you to the Structure Summary page for that model. Clicking a column heading sorts the list on that heading. Try this by clicking Source above the third column. Then look down the Source column. This makes it easy to find the non-Bos taurus entries, which include that adrenergic receptor. Anything else?

Page 29: tics and Homology Modeling

Now you know how to search the PDB for models whose sequences are similar to a target or query sequence. Structural biologists use such searches when they have a new protein sequence and want to know its structure. If the structure is known, this search would find it, so if you are interested in the structure of a particular gene product, search PDB with its sequence to see if the structure is already known. If not, any hits with high sequence similarity can tell you the overall fold of the protein. You also got a glimpse of the Custom Report tool, which can make it easy for you to organize and peruse a large number of hits from any search.

Next, how to obtain a model if no experimental model is known.

NEXT

Bioinformatics Tutorial

Seeking Structure

Preview

In this section, you will learn how to use a FASTA sequence as a search input (query) to the Protein Data Bank, the repository of almost all protein models that have been deduced by X-ray crystallography or NMR. Your search will tell you whether anyone has produced an experimental model of your query protein, or whether models are available for any protein of similar sequence. You will also visualize the model using an online graphics tool. Finally, you will learn how turn a long list of hits into an interactive Custom Report that makes details of each hit easy to find.

What is the structure of an opsin?

By now, perhaps you are curious about the structure of peropsin, but it's not likely that the structure of a protein of unknown function has been determined. It is likely that all opsins are similar in structure, so you can try to find a model of a similar sequence in the database for macromolecular structures, the Protein Data Bank (PDB). It will give you an idea of what kind of protein molecule an opsin is.

In fact, the PDB does not contain molecular structures at all. Is is better to say that it contains models of macromolecules. These models are interpretations of data from one of the two main methods of macromolecular structure determination: X-ray crystallography and NMR spectroscopy. When researchers make a model, or as they commonly say, "determine the structure" of a macromolecule, they deposit a file containing the three-dimensional coordinates of all the atoms in the model. This coordinate file—along with an online molecular graphics tool (like the PDB's Jmol Viewer) or a computer graphics program like DeepView—are all that you need to see and study the model on your computer. Next you will retrieve a model from the PDB and

Page 30: tics and Homology Modeling

view it with an online graphics tool. You will also visit the home of a topnotch computer graphics program that you can download FREE and use on your home computer.

Point your browser to http://www.rcsb.org/pdb/.

The PDB home page contains a simple search box at the top. You can search for models using simple keywords or PDB ID codes. An PDB code has four characters, like 1CYO. How would you ever know a model by its code? When a new structure is published, the authors usually give the PDB code in the last reference of the bibiography. With that code, you can go straight to the model you want to see. But more often, your question, like ours, is more general. For such cases, PDB also provides forms for more sophisticated searches. For now, let's just see if any opsin models are availalble. Type "opsin" into the search box, make sure the PDB ID or keyword is selected, and click Site Search.

On 2008/09/22, this search returned only one model, which is quite puzzling, because a search for "rhodopsin" returns 48 models. So it appears that the quicky (quirky?) search tool at the PDB still needs some work. But this shortcoming is a gift for now. You have bagged an experimental model of an opsin; the PDB contains only models derived experimentally—either by x-ray crystallography or NMR spectroscopy. Now take a look at this one.

Click the PDB file code 3CAP above the tiny image of the model.

You have come to the Structure Summary page for this model, which is its home page at the PDB. This page is connected to just about everything you could possible do with this model. At the PDB, your first goal is always to get to the Structure Summary page for the model you are seeking.

NOTE: Structure Summary does not exactly jump out at you on this page. It's the tab selected over the main part of the entry, and it is a sub-tab of the Structure tab above the left column. Those tabs should be more prominent—they are what distinguishes each of the important pages in the PDB. If you want to know where you are in the PDB, look at the two sets of tabs at the top of the page. The set on the left are main tabs, and the set on the right are sub tabs of the main tabs. Main tabs take you to PDB's major sets of tools, and sub tabs subdivide them. Sub tabs under the Structure tab open LOTS of additional information about the currently chosen model.

In the left column of all PDB pages, you find a set of nested menus (they might vary on different PDB pages). Click Display Molecule to open the PDB display options. If you already own or use one of the listed viewers, like the free program DeepView, you are in business. Click your viewer to download the model and view it in a familiar environment. But first behave as if you are new to all this (perhaps you are), and use a handy viewer that works in your browser.

Page 31: tics and Homology Modeling

Click Jmol Viewer. Assuming that your computer has up-to-date Java software, your browser will load the viewer, and it will load the file 3CAP. Your should see models of two rhodopsin molecules—with backbones shown as ribbon-like cartoons, one green, one blue—and several ball-and-stick models of smaller molecules. Is rhodopsin a dimer? No, but in the crystals of rhodopsin from which this model was derived contained two rhodopsin molecules per asymmetric unit (the smallest portion from which the entire unit cell of the crystal can be constructed). PDB files usually show the full contents of the asymmetric unit. If more than one molecule is present, they are referred to as chains in the model.

NOTE ON VIEWERS: The viewer embedded in the viewing frame of this page is the widely used Jmol, which you will find in use as a molecular viewer at many web sites. If you take time to get to know this viewer fairly well, you will get more out of the many sites that use it. Like most of the other viewers listed at PDB, Jmol is quite limited in its capacity for analysis of protein structure.

In my humble opinion, the most powerful protein-analysis tool listed at PDB is DeepView. DeepView may be the only protein-structure viewing and analysis tool you will ever need. You will learn about it in if you continue into the homology modeling section, later.

Here are some other things you can do to get to know models in a Jmol frame (to get back to the original rendition, reload the page):

Click/drag (left button if you have more than one) on the image to rotate the structure. You should be able to tell that is has a lot of alpha helix.

Hold down option (for Macintosh; alt for Windows) and click/drag to zoom in (drag towards you) or out (drag away) or the rotate the model in the plane of the screen (drag left or right).

Hold down ctrl (or right-click) the image: up pops a set of menus, and if you browse around on them, you'll see that there is much more to Jmol. Try just a couple of things to get some general ideas, as follows.

Using the pop-up menus, Select:Protein:All. EXPLANATION: This means to slide to Select on the main pop-up menu, then on its submenu, slide to Protein, then on its submenu, slide to All. On my Macintosh computer, if I right-click a menu or submenu item, its submenu locks on display, and it's easier to navigate.Nothing appears to happen. You have selected part of the model (the protein part, but not the small molecules). Subsequent commands will change only this aspect of the display.

Color:Structure:Cartoon:By Scheme:Secondary StructureThe cartoons become red (well, bright pink) for alpha helix, and yellow for beta sheet. You probably had not noticed the beta sheet in the models before. Look one of the chains over carefully to get a feeling for its structure. How many helices are present? How many strands of beta sheet? Are the strands parallel or antiparallel?

Do you know how to view stereo pairs? (If not, click HERE to learn how.) Then Style:Stereographic:(choose your favorite mode, cross-eyed or wall eyed

Page 32: tics and Homology Modeling

viewing). NOTE, As of 2008/09/22, you get the opposite of what you pick. Despite my attempts to inform the programmers about this, cross-eyed viewing gives you wall-eyed, and vice versa. But anyhow, now you can see the model as a solid object with convincing depth. If you are ever going to do anything serious with protein structure, you'll need to find a way to view them in 3D.

Work in stereo or not, as you prefer. Clear the display: Select:None; then Select:Display Selected Only. The display goes blank; nothing is selected and you are displaying only the selection (very logical!).

Select:Protein:All (means select both backbond and sidechains). Then Style:Scheme:CPK Spacefilling. The protein portion is now show as a spacefilling model. In this rendition, you get a good idea of the overall shape of the protein. Unfortunately, the Jmol menu does not allow you to color the two chains separately or get rid of one of them.

Style:Scheme:Wireframe. Now you see all of the protein parts of this model in wireframe. This is not as impressive as some other schemes, but is actually the most useful when you start exploring models in detail, because the wires do not hide each other like ball and sticks or spacefilling models.

To learn more about Jmol, consult the help links at PDB below the display. You can also find extensive help for all viewers listed there. But if you plan serious protein structure work, especially judging model quality and comparing models by superimposing them, get to know DeepView.

Finding Opsin Homologs in the PDB

Next, you will try to find other models in the PDB that are homologous to the human opsins. You will ask the PDB, in effect, to "list all models whose sequences can be aligned with that of human red opsin, in order of sequence similarity." In PDB terminology, the red opsin sequence is the query, and similar models found (hits) are called subjects.

First, open your query file protred.txt (FASTA sequence of red human opsin), and copy the sequence portion only to the clipboard; omit all of the comment line that begins with >.

At the top right of any PDB page, click Search. From the list of search types, click Sequence. On the resulting page, click the button next to use Sequence, and paste your red opsin sequence into the box just below. Not that the search tool is your new friend Blast, and that a E cut-off value of 10 is given as a default. From what you learned earlier, you know that this is not a very restrictive search criterion, so your search should pick up anything remotely similar in sequence to the red human opsin. Click the search button. The search tool is now looking for PDB models whose sequences are similar to the human red opsin sequence. Hits in UniProt are just other proteins, most of whose structures are not known. Hits in the PDB are models, so hits tell you that there are experimental models for one or more proteins that are similar in sequence to your query.

Page 33: tics and Homology Modeling

On 2008/09/22, I got 26 subjects, or 26 PDB models whose sequences are homologous to the search sequence. Each is listed with an E-value, which is the probability that the sequence similarity between query and subject is a coincidence. The first result or subject is PDB model 1F88, a model of bovine rhodopsin. The E-value is 6.2 x 10 -74 . In other words, while the probability that a coin flip and your call will agree just by chance is 0.5, the probability that the similarity between human red opsin and bovine rhodopsin is just a chance occurence is

0.000000000000000000000000000000000000000000000000000000000000000000000000062,

which means, to any sane biologist, that these two molecules descended from a common ancestor. There is no chance that, in the history of the universe, two proteins could arrive at sequences this similar by chance. This also means that the structure of the bovine rhodopsin is a sure bet to be very similar to that of the human red opsin, whose structure is unknown (if if were known, this search would have found it).

Now look down the list of the models you found. Most are models of the same substance: bovine rhodopsin (lumirhodopsin, bathorhodopsin, and some others are altered forms that represent rhodopsin in different stages of the visual cycle, but notice that all of these come from Bos taurus, from which the good old barnyard cow got the name Bossy. A few hits are the recently published beta-2-adrenergic receptor, the first G protein coupled receptor model besides rhodopsin. Perhaps by the time you take this tutorial, there will be more.

Use the results page to answer these questions about the comparison between human red opsin and the bovine rhodopsin in PDB 1F88:

1. How many corresponding residues, and what percent of the residues, do the two proteins have in common (exact matches)?

2. How many and what percent of corresponding residues are similar in chemical properties?

3. How many gaps did the alignment program introduce, and how many residues in each gap, to get best alignment between human red opsin and 1F88?

4. Find the longest string of exact matches between the two proteins. How many matches does it contain, and what are the beginning and ending residue numbers?

Reports: Simplifying a Search Through Many Hits

Results pages are difficult to deal with if you want to look around on a long (anything more than 10) list of subjects (hits). To make a display that is easier to navigate, in the left column, click Tabulate, and then Custom Report. You can use this Custom Tabular Report form to generate a list of your subject that includes any features of interest. For now, you will generate a very simple list, but you will quickly see its power.

Page 34: tics and Homology Modeling

On the form, click to put checkmarks in these boxes: Descriptor (under Structure Summary), and Source (under Biological Details). Then click Create Report at the bottom of the form.

The custom report appears, with three columns, PDB ID code, model descriptor, and biological source of the protein. The form contains many clickable items. Clicking an ID code takes you to the Structure Summary page for that model. Clicking a column heading sorts the list on that heading. Try this by clicking Source above the third column. Then look down the Source column. This makes it easy to find the non-Bos taurus entries, which include that adrenergic receptor. Anything else?

Now you know how to search the PDB for models whose sequences are similar to a target or query sequence. Structural biologists use such searches when they have a new protein sequence and want to know its structure. If the structure is known, this search would find it, so if you are interested in the structure of a particular gene product, search PDB with its sequence to see if the structure is already known. If not, any hits with high sequence similarity can tell you the overall fold of the protein. You also got a glimpse of the Custom Report tool, which can make it easy for you to organize and peruse a large number of hits from any search.

Next, how to obtain a model if no experimental model is known.

NEXT

Bioinformatics Tutorial

Summary

You have used these categories of tools in this tutorial:

1. Databases like GenBank, UniProt, and PDB store sequence and structural data in the form of entries (each with a unique code) that correspond to a single gene or its protein product. The databases provide extensive information about each entry, ranging from brief pop-up information, to links that submit the entry to various search and analysis tool (below), to encyclopedias of information about the entry, or to the results of automated searches in PubMed for publications related the entry. Databases also provide sequences in formats (like FASTA) that serve as search queries in the same or other databases.

2. Search Tools can be integral parts of databases, or stand-alone programs. Integral search tools allowing you to search with keywords, with FASTA sequences, or with entry numbers from other databases. Stand-alone search tools like BLAST allow you to find sequences (hits) similar to sequences of interest to you (queries).

3. Analysis Tools (example: PROSITE) use single sequences to determine properties or identify functions of genes and their products. Sequence comparison tools like ClustalW and Tcoffee perform multiple sequence alignments and

Page 35: tics and Homology Modeling

produce phylogenetic trees, showing vividly how genes are related to each other. Consensus tree-building tools like Phylip and PhyML build trees based on many interations of random sampling and alignment of the sequences being compared, thus reducing the possibility of bias from a single sequence alignment. Phylodendron lets you print trees to you liking, using tree data in Newick format from any tree-building tool.

4. Modelling Tools like Swiss-Model provide, or assist you in building, homology models of proteins of unknown structure. The modeling program DeepView (also knowns as Swiss-PdbViewer) helps you to build homology models, as well as to study and judge the quality of all types of models (homology, X-ray, NMR). DeepView and SWISS-MODEL are integrated, so you can move back and forth between them at any point in a modeling project.

More Than Meets The Eye

All of the tools you have used here are much more complex and powerful, and require more judgement to use properly, than you might think from your use of them so far. You have only scratched their surfaces. For example, programs like BLAST and ClustalW have many settings that allow the user to control many aspects of the analysis. When you click a link to ClustalW and get a multiple-sequence alignment with no fuss, you have used default settings that might not be the best for your task. For serious scientific work, you need to visit sites that provide full implementations of search, alignment, and analysis tools, giving you full control of the task, but also requiring deeper understanding of the kind of analysis you are doing. This kind of knowledge is crucial to judging the quality of your results (an aspect in which this tutorial is very weak).

To learn more about specific tools, go directly to any network service, such as ExPASy or NCBI, that provides the tool you want to use. First, you will find links to extensive user manuals that tell how the analysis tools work. You might also find lists of frequently asked questions (FAQs) about the tool. Finally, you will find a direct link to a form for running the tool, in which you can make all settings, put in a query, and run the tool. Only trouble is, as a beginner, you often do not know what settings to put in.

In my opinion, the best services for beginners are those that provide settings in pull-down menus that show you all of the allowed settings. As an example, go to EMBL-EBI, another great online service, and click Sequence Similarity and Analysis. In the left-hand column, under Sequence Analysis, click ClustalW2. The resulting form shows all of the ClustalW settings in the form of pull-down menus, so you don't have to know the possible settings and type them in—all allowed settings are displayed in the menus, so you can't go wrong. The settings shown when you arrive (called the defaults) are probably the same settings applied to your analysis when you clicked the quick link from your table of opsin entries at UniProt to get your Clustalw multiple-sequence analysis. In fact, if you go back to that page, you will see that the box at the top contains all FASTA files in sequence. If you want to see how other settings affect the analysis, you can use paste this set of files, as one block of text, into the EMBL-EBI Clustalw form, play with

Page 36: tics and Homology Modeling

settings, and get multiple-sequence analyses to your heart's delight. This is a great way to learn more about a tool that you want to use wisely. EMBL-EBI provides most of the common bioinformatics in this beginner-friendly kind of environment.

Where Do You Go From Here?

Now you have had a very basic introduction to bioinformatics. With the tools you've tried out, you can explore the vast stores of genetic and structural information available on the Internet. Every page you have visited has many more links to other tools. You can figure out a lot just by visiting them and playing around, and there is usually plenty of built-in help and. I hope this tutorial spurs you to learn more about how to use bioinformatics in your work.

For a more rigorous and systematic, yet readable and clear, survey of the full range of bioinformatics, get the latest editionof Bioinformatics for Dummies, by Claverie and Notredame, Wiley Publishing, Inc. It will help you learn to use the tools wisely, and to judge the reliability of your results. I recently bought the 2007 edition, and I have learned a lot of cool new stuff. The new edition helped immensely in updating this tutorial. It's the best thing I know of to take you further.

NEXT -- Test Your New Skills

Bioinformatics Tutorial

Test Your New Skills

Here is a problem you should be able to solve using what you learned in this tutorial.

Humans cannot synthesize vitamin C (ascorbate), and so we must obtain it from our diet. Many mammals, including mice, can make ascorbate. In the time since our line diverged from that of rodents, we have lost one enzyme, gulonolactone oxidase, the final enzyme in the pathway of ascorbate synthesis (if interested, read more).

This means that humans have an evolutionary ancestor that possessed a functional gulonolactone oxidase gene. It stands to reason that humans should possess a nonfunctional remnant of that gene (called a pseudogene).

Can you find a remnant of the gulonolactone oxidase gene in the human genome?

Happy hunting!