using blast, fasta and hybridization theory to select c. elegans...

141
USING bl- fasta AND HYBRIDIZATION THEORY TO SELECT C elegans GENOMIC DNA SEQUENCE FROM DATABASES THAT WOULD HYBRIDIZE WITH OPSIN cDNA PROBES Ping Feng -. i I B. Sc., Fudan University (China), 1983 THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in the Department of Biological Sciences O Ping Feng 1997 SIMON FRASER UNIVERSITY August, 1997 All rights reserved. llus work may not be reproduced in whole or in part, by photocopy or other means, without permission of the author.

Upload: others

Post on 21-Jan-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

USING bl- fasta AND HYBRIDIZATION THEORY TO SELECT C

elegans GENOMIC DNA SEQUENCE FROM DATABASES THAT

WOULD HYBRIDIZE WITH OPSIN cDNA PROBES

Ping Feng -. i I B. Sc., Fudan University (China), 1983

THESIS SUBMITTED IN PARTIAL FULFILLMENT OF

THE REQUIREMENTS FOR THE DEGREE OF

MASTER OF SCIENCE

in the Department

of

Biological Sciences

O Ping Feng 1997

SIMON FRASER UNIVERSITY

August, 1997

All rights reserved. llus work may not be reproduced in whole or in part, by photocopy

or other means, without permission of the author.

Page 2: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

National Library of Canada

Acquist 7 tons and Bibliographic Services

395 Wellington Street Ottawa ON K 1 A ON4 Canada

Bibliothhue nationale du Canada

Acquisitions et services bibliographiques

Your hlo Vorre rbtdren~e

Our hfe Notre ret4rmce

The author has granted a non- t'auteur a accorde une licence non exclusive licence allowing the exclusive permettant a la National Libra.ty of Canada to Bibliotheque nationale du Canada de reproduce, loan, distribute or sell reproduire, preter, distribuer ou copies of this thesis in microform, vendre des copies de cette these sous paper or electronic formats. la forme de rnicrofiche/film, de

reproduction sur papier ou sur format electronique.

The author retains ownership of the L'auteur conserve la propriete du copyright in t h s thesis. Neither the droit d'auteur qui protege cette these. thesis nor substantial extracts from it Ni la these ni des extraits substantiels may be printed or dtherwise de celle-ci ne doivent &re irnprGes reproduced without the author's ou autrement reproduits sans son permission. autorisation.

Page 3: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

APPROVAL

Name:

Degree:

Ping Feng

MASTER OF SCIENCE

Title of Thesis:

USING BLAST, FASTA AND HYBRIDIZATION THEORY TO SELECT C. ELEGANS GENOMIC DNA SEQUENCES FROM DATABASES THAT WOULD

HYBRIDIZE WITH OPSIN CDNA PROBES.

Examining Committee:

Chair: Dr. M. Moore, Associate Professor

DYA. H7 J. BUG, Associate Professor, Senior Supervisor Department of Biological Sciences, S.F.U.

Dr. D. L. Baillie, Professor Department of Biological Sciences, S.F.U.

Dr. C. M. Boone, Assistant professor Department of Biological Sciences, S.F.U.

Dr. B. P. Brandhorst, Professor Department of Biological Sciences, S.F.U. Public Examiner

Date Approved:

u

Page 4: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

ABSTRACT

To search for opsii homologues, five DNA fragments had been identified by Southern

hybridization of a C. elegans genomic library with Drosophila Rhl opsin cDNA probes.

HOW&, the nearly completed C. elegans sequence database and the availability of computer

sequence similarity search tools now provide an opportunity to iden* such genomic sequences

using a computer. The f m a options were modified so that its selection criteria and scoring f . .

more closely resemble the hybridization process: the gap penalties were maxmned to disallow

gaps, and the matching score for G and C was increased to represent the higher stability of G-C

base pairing compared with A-T.

Theory predicts that hybridization will be triggered by several short regions with > 70 %

identity between two DNA strands, and for hybrid stabilization under low stringency, a long . .

enough region (set as > 100 bp) with > 45 % identity is necessary. Cosmids sequences were

analyzed which either lay in the vicinity of the physical map location of one of the five = hybridizing DNA inserts, or had the highest similarity scores in a blastn search of the C. elegans

genome. Using a modified rfasta, these 14 cosmids were queqed with the two probes: the 1.5

kb whole cDNA and its 0.8 kb 3' segment. Then the hybridization criteria were applied to'

predict which alignments would be likely to hybridize.

This analysis predicted correctly that sequences within five of these 14 cosmids should

hybridize with the both probes, and six others should not. For some reason, sequences within

the remaining three cosmids, predicted to hybridize with the 1.5 kb probe, were not among the

inserts selected by Southern hybridization. None of proteins encoded by the eight sequences

predicted to hybridize were similar to opsin. For comparison, the 14 proteins most similar to

Rhl opsin were selected fiom a fasfa search of the C. elegans protein database. They were all

putative G-protein coupled receptors, but not opsin homologues. To date, no opsin

homologues have been identified among all predicted C. elegans proteins.

Page 5: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

ACKNOWLEDGMENTS 0

I wish to send my special thanks to my senior supervisor, Dr. A. H. Jay Burr for his ,

4

advice and support. His suggestions were very valuable fiom the original idea to the final polish.

- for my work for this thesis.

4 I really appreciate the help f h m Dr. D. L. Balillie and Dr. C. M. Boone, my supervisory

committee members. Their advice and suggestions were very important for the work of-this '

,

thesis. I especially have learnt a lot about computer tools and databases used in this thesis fiom

Dr. Baillie and his graduate student, J. Bryer.

I also wish te-Thnk Dr. B. P. Brandhorst. As a public examiner, Dr. Brandhorst

suggested some important changes to my thesis. I ------.

Page 6: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

Table of Contents

. . Approval .............. ............................................................. .......... ............................... 11

... Abstract ............... . ......... ..... . . .... .... ...... ............. ...... ...... . ... .......... . ..... .......... ........ ........ iu

Acknowledgments ....... .. ... ............... ............ .................... .. . .. ..... .. .. .......... .. . .... ..... ...... . vi

Table of content.. . .. . . . . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . . . . . . . . . .. . . . . . . . . . . . .. . . . .. .... . .. . . .. . . . . . . .. . .. . . . . . . . . .'. . . .. . . v ..

List of tables .............................................................................................................. w~ .. .

List of figures .......... .... :. ......... ......... . ........... . . . . . ........... .......... . . . . . . . . . . . . . wii

Chapter 1 : Introducti~n ..... . . . . . .. . . . .. . .. . .. . . . . . . . . . .. .. . . . . . . .. . . . . . . . . .. . .. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 1

1 . 1 The nematode Caenorhabditis elegans and its light sensitive behaviors .............. 2

1.2 Rhodopsin and G-protein coupled receptors ... . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 3

1.3 C. elegans genome contains some sequences which hybridize with

Drosophila melanogaster rhodopsin cDNA.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.4 Some criteria of nucleic acid hybridization ....... ...... .. . . . . .... .. . .. .. ... .. ... ... .. . .. .. . .. ..... . 12

1.5 The possibility of using computer tools to search for and analysis

opsin-related (or opsin-like) genes in the C. elegans genome ... .... ... . .. .. ..... . ... ... .. 18

Chapter 2: Confirmation that C. elegans genomic DNA fragments hybridize

with D. melanogaster Rhl cDNA.. ....... . ..... . . . .. ...... . .... ... . . . . . . . . . . . . . . 22.

2.1 Choice of the probe ..... ... .. . ... ... .. . ... .... ........ . .... . .... . . .. . . . ... . . .. . . . ... .. .. . .. . . . . ... . . . .. ... .... 23 . . .

2.2 Southern hybnd~zat~on.. .. .. . .. . ... .. . ... .... . . . . . . . . . , , . , . . , . . . . , . . , . . . . , . . . . . . . 27

2.3 Result and discussion ...... . . ...... . .. . .............. . , . . , , , , . , . . , . , , . . , . . . . . . . . . . . . . . . . 33

Chapter 3: Searching currently sequenced regions of the C. elegarls genome

sequences similar to D. melanogaster Rhl Probes .................................... 37

3.1 The basic features of computer searching systems . . . . . . . . . . . . . . . . . ............................. 38

3.2 The blast system..: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.3 The fasta system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.4 Results of blastn searches of ncbi and ACeDB databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

Page 7: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

3.5 Result of fasta searches of the EMBL invertebrate genome database .................. 45

3.6 Cosmid sequences selection for krther local s ~ l a r i t y comparisons ................... 50

Chapter 4: Local similarity between selected C . elegans cosmid sequenceis

1 ................................................ and the D . melmgaster opsin Rhl cDNA 56

...... 4.1 Identification of sequenced cosrnids that may contain the five cloned inserts 57

......................................... 4.2 Results of the f d a scans and the ljiasta comparisons 61

........................................................... 4.3 Analysis of the selected Ifarta alignments 83

....................................................................................................... 4.4 Conclusions 86

Chapter 5: Analysis of protein sequences and structure of opsin-related genes

............................................................................. ............... . of C ele ans : 88

5.1 Analys' f protein sequences encoded by the cosrnids selected for

.................................................. si / laxity to the Rhl probe nucleotide sequence 89

. . . . . . . . . . . . . . . . . . . . . . 5.2 Searching for the most opsin-like protein sequences of C elegans 100

....................................................................................................... . 5.3 Conclusions 122

................................................................................ Chapter 6: General Discussion 123

.......................................................................................... References . . . . . . . . . . . . . . . . . . . .. 127 '%

Page 8: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

List of Tables

I Table 3- 1 : The comparison of standard and modified fasta DNA score matrixes.. ...... 40

.................... , Table 3-2: Characterization of sequence similarity byjbsta (Four steps) 43

Table 3-3: Selected C. elegans cosmid sequences with high similarity to Rhl cDNA . 54

........ Table 4- 1 : The. C. elegans cosmids selected for the local similarity comparison.. 60

Table 5-1 : Proteins coded by the selected C. elegans cosmid sequences.. ................... 90

Table 5-2: Proteins of C. elegam most similar in amino acid quence

....................................................................................... to the Rh 1 probe 104

vii

Page 9: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

List of Figures

Figure 1 - 1 : Structural and hnctional domains of G-protein coupled receptors ........... 5

Figure 1-2: Southern blots of C.elegans genornic clones (S401 - S405) j .*

.................................... Hybridized with Drosophila opsin cDNA probes.. 10

Figure 1-3: Tam and the two-step processes of hybridization and hybrid dissociation 14

Figure 2-1: D. melanogaster Rhl opsin cDNA sequence and its deduced

amino acid sequence (rhodopsin ninaE) compared with bovine

rhodopsin"arnino acid sequence ............................................................... 24

Figure 2-2: Hybridization of nematode (C. elegans and Mennis nigrescens)

genomic DNA digested by EcoR I with the 1.5 kb 3 2 ~ - labeled

D. melanog&ter Rhl opsin cDNA probe ................................................. 3 1

Figure 2-3: Hybridization of phage inserts, S401 - S405, digested by EcoR I 1 Hind LII

with theG 1.5 kb 3 2 ~ - labeled D. melanogaster Rh 1 opsin cDNA probe . . . . 34

Figure 3- 1: The blastn searches of the ncbi DNA q u e n c e database . . . . . . . . . . . . . . . . . . . . . . . . . 46

Figure 3-2: The blastn searches of the ACeDB DNA sequence database.. . . . . . . . . . . . . . . . . . . 48

Figure 3-3: The fastu searches of the EMBL invertebrate genome database . . . . . . . . . . . . . . . 5 1

Figure 4-1 : The physical map location of the five phage inserts (BC#S401 - BC#S405) -which hybridize with Rhl probes showing the cosrnids

. .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . near these locations 58

Figure 4-2: Alignments of fmtu scans of the cosrnids of Group 1 (C37C3,

F58A4 and T2 1B6) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 --

Figure 4-3: Alignments of fastu scans of the cosmids of Group 2 (C37A2,

ZK742 and C 15H7). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

Figure 4-4: Alignments of fmta scans of the cosmids of Group 2 (C48D5,

C54C6 and K09C8) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

Figure 4-5: Alignments ofjbsta scans of the cosrnids of Group 3 (F10E7, ,,

viii

*

Page 10: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

F44BP and TOSAIO) ................................................................................ 69

Figure 4-6: Alignments of f&a scans of the cosmids of Group 3 (C07G1,

and R 1 0E4). ............................................................................................. 7 1

Figure 4-7: The &sta alignment which most favors hybridization of the

......................................... Group 1 cosmids (C37C3, F58A4 and T21B6) 73

Figure 4-8: The &asda alignment which most favois hybridization of the

Group 2 cosrnids (C37A2, ZK742 and C 15H7) ........................................ 75

Figure 4-9: The rfuta alignment which most favors hybridization of the

Group 2 cosmids (C48D5, C54C6 and K09C8) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

Figure 4- 10. The rfuta alignment which most favors hybridization of the

Group 3 cosmids (TOSAIO, R10E4 and C07G1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

Figure 4- 1 1@ The lfata alignment which most favors hybridization of the

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Group 3 cosmids (F 10E7 and F44B9) 8 1

Figure 5- 1 Results of lfata comparison between C37C3 2, F58A4.1 and T2 1 B6.3

proteins and Rhl (ninaE) probe protein sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

Figure 5-2: Results of lfasta comparison between C37A2.1, C54C6.2, F 1 OE7.2 and

F44B9.7 proteins and Rh 1 (ninaE) probe protein sequence .,!.. . . . . . . . . . . . . . . . . . . . 93 .:. - . < .

Figure 5-3: Most similar proteins to C37C3.2, F58A4.1 and T21B6.3 from

the blastp search . . . . . . . . . . . . . . . . . . . . . . . .,. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

Figure 5-4: Most similar proteins to C37A2.1, C54C6.2, F10E7.2 and F44B9.7

from the blastp search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

Figure 5-5 : Results of futa search in C. elegans sequenced protein library

with Rhl amino acid sequence as the query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 I

Figure 5-6: Results of fmta or lfarta comparison between C52B 11.3, F47D12.2,

C25G6.5 and T27D 1.3 proteins and Rh 1 (ninaE) probe protein sequence. 105

Figure 5-7: Results of fmta or lfasta comparison between C39E6.6, F01E 1 1.5,

ZK455.3 and C38C 10.1 proteins and Rhl (ninaE) probe protein sequence 107

Page 11: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

Figure 5-8: Results offasra or rfasta comparison between T07D4.1, F35G8.1,

T05A1.1 and C5OF7.1 proteins and Rh 1 (ninaE) probe protein sequence.. 109

Figure 5-9: Results of fwa or ,fits comparison between F56B6.5 and C48CS. 1

proteins and Rhl (ninaE) probe protein sequence .................................... 1 1 1

Figure 5-10: Most similar proteins to C52B11.3, F47D12.2, C25G6.5

and T27D1.3 from the blastp search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

Figure 5-1 1: Most similar proteins to C39E6.6, FOlE11.5,ZK455.3

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . and C3 8C 10.1 fiom the blasp search.. 1 15

Figure 5-12: Most similar proteins to T07D4.1, F35G8.1, TO5Al. 1 3

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . and C50F7.1 from the blaslp search 1 1 7 ci 8

Figure 5- 13: Most similar proteins to F56B6.5 and C48C5.1 fiom the blastp search. 1 19

Page 12: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

Chapter 1

Introduction

Page 13: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

4

1.1 The nematode Caenorhabditis elegans and its light sensitive behaviors.

Caenwhabdrtis eleguns is a kee-living soil nematode which has been found in many parts /)

of the world C. eleguns fekds mainly on bacteria and reproduces with a life cycle of about three

days under optimal conditions. Adults of C. elegans are about 1 mm in length, and have distinct 5

shapes according to the two sexes, hermaphrodite and male. Hermaphrodites can produce both

oocytes and sperm and reproduce by self-fertilization, but cannot cross fertilize each other.

Males, which occur at very low fiequency can fertilize hermaphrodites (Wood, 1988).

C. elegans offers great potential for most biological research arq&.- It was chosen30

years ago as a model organism to study animal development and behavior because of its short life

cycle, small size, ease of laboratory cultivation and the transparent body. Importantly, since C.

elegans can easily produce hundreds offprogenies fiom a single hermaphrodite with no genetic

change, or fiom a male-mated hermaphrodite with genetic recombination, it offers the

convenience for genetic analysis which previously existed only for some plants or microorganisms.

Other advantages include the small and simple genome, easy access w d high fiequency of

mutations, etc. Wddle, et al., 1997). P

C. elegans hermaphrodites have only 302 neurons with structure and hnction such as in

higher invertebrates and vertebrates. Corresponding to this simple nervous system, the sensory

responses are also simple. C. elegans can respond to a variety of stimuli, including gradients of

temperature and/or many different chemical attractants and repellents by migrating either up or

down the gradient (taxis). Other stimuli, such as touch, vibration and light cause a reversal of

locomotion (Wood, 1988). Volatile odorants are detected by the and AWC neurons

Page 14: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

with flattened, branched cilia near f i m p h i d pore, and water soluble odorants by neurons with

modified cilia extending out through the amphid pore (Bargmann, et al., 1993). Two classes of

seven transme~brane-domain receptor molecules are candidates for chemoreception. The sr class

is expressed in the neurons which detect water soluble attractants, but not in AWA, AWB or (i

AWC (Troemel, a al., 1995). Another class includes odr-10 which is expressed only in AWA

and its mutation eliminates response to diacetyl (Sengupta, et al., 1996). It is possible that other

members of the &-lo class are receptors for other volatile odorants (Bargmann and Mori,

4

Although C. elegans lacks an eye, it has a reversal response to light (Burr, 1985). This

suggests that C. elegans has a photosensitive neuron with a photosensitive receptor molecule.

The worm responds significantly to monochromatic light at wavelengths fiom 520 to 600 nm, a

range suggestive of rhodopsin, the only known visual photopigment of animals (Burr, 1985). -4

Because C. elegans is known to contain other G-protein coupled receptors, the presence of a

rhodopsin appeared likely_ . I,

\

1.2 Rhodopsin and Gprotein coupled receptors.

Bovine rhodopsin was the first member of the G-protein coupled receptor superfamily

whose amino acid sequence was determined. Because of the huge quantities of concentrated

rhodopsin in the outer segment of the rod cell of vertebrate retina, rhodopsin (or its peptide - opsin) is easy to access. This, and the tremendous interest in visual pigments, are the reasons why

the rhodopsin was studied both in structure and in hnction. Such studies have demonstrated its

Page 15: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

essential role in initiating phototransduction, as well as a model for understanding the structure Y

and hnction of the G-protein coupled receptor family (Hargrave and McDowell, 1992).

G-protein coupled receptors are now known to make up a huge family of integral

membrane proteins with varied sizes and sequences. All members of this family have a distinct

structure of seven transmembrane hydrophobic helixes (Watson and Arkinstall, 1994). Other

regions of the polypeptide chain include the two termini (extracellular N-terminus and intracellular

C-terminus) and seven loops. Three loops are on the extracellular surface (named el, e2 and e3), *

three are between helixes on the intracellular surface (named il,'i2, and i3), and a seventh, i4, is

formed between helix VII and palmitoylated cystein(s) near the C-terminus. Specific structural

and functional domains of this receptor family are shown and described jn Fig. 1-1.

G-protein coupled receptors h&e several conserved regions in their primary and

secondary structure which are related to their basic function - activation of a G-protein coupled

signal transduction pathway. The general mechanism of a G-protein coupled signal transduction /

pathway (and the phototransduction pathway, give0 irf parenthesis) proceeds as follows:

The G-protein coupled receptor (rhodopsin) interactswith a ligand (light activated 1 1-cis retinal)

to form an activated receptor-ligand complex (activated rhodopsin). This activates the G-protein

(transducin) subunit Gar (Ta) to form Ga-GTP (Ta-GTP). The Ga-GTP influences the activity - of a target enzyme (Ta-GTP activates cGMP phosphodiesterase, PDE), whose substrate or

C

product influences a function in its particular differentiated cell (PDE hydrolyzes cGMP.to

decrease cGMP concentration, which causes cation channels to close) 4

(Hargraveand McDowell, 1992). Because most of G-proteins are similar to each other but the B

Page 16: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

Fig. 1-1: StrLctural and fhctional domains of G-protein coupled receptors

(Reprinted from Iismaa, et al., 1995). Cylinders with numbers are

putative transmembrane domains. Letters present the structural

domains with some general, but not necessarily universal, fbndiond

features.

a) N-terminal extracellular domain contains potential site for N-linked glycosylation (Y) in most

receptors which are usual binding sites for glycoprotein hormone, metabotropic glutamate,

and other peptides.

b) The first and second extracellular loops contain Cys residues which are involved in disulfide

bridge formation to keep the structural integrity of adrenergic, muscarinic acetylcholine

receptors and rhodopsin.

c) The transmembrane domains contain residues critical for ligand binding in the three kinds of

G-protein coupled receptors mention in b). #

d) The intracellular loops and the proximal 'part of the C-terminal domain are involved in

coupling to G-proteins with particular importance ascribed to the part of the third intracellular

loop between transmembrane domains 5 and 6. I

e) The C-terminal domain contains one (or two) Cys residues that are sites for palmitoylation to

a fatty acid which is inserted into the lipid bilayer to form the fourth intracellular loop. Both

the C-terminal domain and the third intracellular loop contain Ser and Thr residues for

phosphorylation during receptor desensitization.

Page 17: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c
Page 18: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

ligands vary, the structures and sequences of those do~mains involved with G-protein binding and

activation are usually conserved, and those involved with specific binding to ligands vary.

Rhodopsin has all the structural and hnctional features of a typical G-protein coupled

receptor. But it has its own specific features related to its distinct function, phototransduction,

and to its light-sensitive ligand, 1 1-cis retinal. Unlike other G-protein coupled receptors,

this ligand is covalently bound to a lysine (K) in the helix VII by a S M s base linkage, and this

lysine is present in all opsins. For the binding of retinal and activation of the rhodopsin, a special

binding pocket is formed within the transmembrane region. Of the residues belong to this pocket,

a residue in helix 111 involved in binding 1 1-cis retinal is completely conserved: E in all vertebrate

opsins, and Y or F in invertebrate opsins (Sakrnar, 1994). Certain residues on different

transmembrane helixes, which also belong to this pocket, govern the absorption spectrum of

rhodopsin. Some of these vary (as does the absorption maximum), some are highly conserved, and

W and Y are found in helix VI of all vertebrate and invertebrate opsin sequences (Archer, et al.,

1992; Smith, et al., 1993). These conserved residues, especially the K in helix VII, can be used to

distinguish rhodopsin-like visual segments fiom all other G-protein coupled receptors margrave

and McDowell, 1992; Nathans, 1992). t

The most common and efficient method to isolate and identi@ opsin genes is cross-species

nucleic acid hybridization of cDNA prepared fiom RNA extracts of eyes with an opsin gene

nucleotide sequence as the probe. The methods using hybridization include Southern and

Northern hybridization and the polymerase chain reaction (PCR). Because of their highly

conserved sequence in the important hnctional regions, most of known opsin genes have been

found successfklly by this way. The rhodopsin of Drosophila, a highly evolved invertebrate, can

Page 19: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

be isolated by hybridization with a poorly matched opsin probe from the cow, a highly evolved

vertebrate, even though those two branches separated 500 million years ago (O'Tousa, et al.,

1985).

1.3 C elegans genome contains some sequences which hybridbe with ~md&hilo

melanogaster rhodopsin cDNA.

1.3.1 Early indication of opsin homologous sequences ink@. elegans.

Encouraged by the successes with other organisms, our laboratory initiated a search for an

opsin in C. elegans in collaboration with John boor^'^ and Michael Smith (Boom, J., Lobo, K.,

Smith, M. J. and Burr, A. H.. unpublished). C. elegans genomic DNA was digested by restriction

endonuclease EcoR I and blotted. This genomic blot was probed with Drosophila rhodopsin

cDNA, specifically a 32~-labeled 1.5 kb whole cDNA from Rhl, one of five rhodopsin genes

expressed in Drosophila. The optimum conditions of hybridization and wash were at a moderate

stringency level: 620C in 5 X'SSPE (see section 2.2. l), 5 x Denhardt's (Denhardt, 1966) and 0.3

% SDS, four hours prehybridization followed by overnight hybridization; three final washes of

one hour each at 620C in 1 x SSPE.

As the result of the hybridization, four strongly and several weakly hybridizing fiagments,

1.5 - 7.7 kb in length, were identified. Because the hybridization oc ed under moderate bh stringency, higher than thebsually low stringency conditions required for most cross-species

hybridization of opsin, these genomic DNA fiagments appeared to have relatively higher sequence

similarity to the rhodopsin cDNA probe. This encouraged us to continue.

Page 20: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

1.3.2 Identification of five clones fiom a C. elegans genomic library.

Under the above hybridization conditions, a C. elegans genomic library in Charon 4A .

phage vector was screened by hybridization with the same probe. A total of 13 clones

representing five non-overlapping genomic fiagrnents were isolated. To investigate the homology

between those five inserts and specific regions of rhodopsin, the 1.5 kb probe was digested by

enzyme Pst I into three restriction fragments, 0.6, 0.8 and 0.2 kb in length (see Chapter 2).

The five phage clones, named as BC#S40 1 - BC#S405, were digested by EcoR I / Hind

111, then hybridized with the 0.8 kb probe (Fig. 1-2, A) and the 0.6 kb probe (Fig. 1-2, B). The

0.8 kb probe, which was thought to be more opsin-specific in structure, hybridized strongly with

at least one restriction fiagment in each clone, while the 0.6 kb probe, which might contain more

conserved sequences of G-protein cou'pled receptor family, hybridized weakly with only four of

the five clones (no hybridization with S401). Therefore these five inserts appear to be more

similar to the 3'-portion than the 5'-region of opsin cDNA, that is, more similar to the opsin - specific sequence than the common G-protein coupled receptor-specific sequence.

1.3.3 Localization of the five phage clone inserts on the physical map of the C. elegans genome.

The five independent phage clones were localized (by John Coulson, Cambridge

University) to five widely separated positions on the physical map of C. elegans genome: Y,

- BC#S40 1 : near dpy 14 gene on chromosome I

- BC#S402: near the stP23 site on chromosome V

- BC#S403 : near lin 12 gene on chromosome 411.

Page 21: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

Fig. 1-2: Southern blots of C. elegans genomic clones (S40 1 - S405) hybridized

with Drosophila opsin cDNA probes. The lambda phage clones with

genornic DNA inserts were'digested with EcoR I and Hind 111, 2.5 pg

DNA per lane was electrophoiesed in 0.8 % agarose gels. Southern

transfer was probed at moderate string cy (see the text) with: A. The r 0.8 kb 3 '- segment of Drosophila opsi ninaE eDNA (eight hours

exposure with intensifymg screen); B. The 0.6 kb 5'- segment of the

v same cDNA with sequence coding for the G-protein binding region r

(five days exposure with intensifjmg screen). The specific activity of I labeled probe was lo8 cpm I pg.

Page 22: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c
Page 23: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

- BC#S404: near the unc 93 area on chromosome 111.

- ~ b I s 4 0 5 : near Lin 14 gene on chromosome X.

Thus, five inserts could belong to five different genes.

The results of Southern hybridization and the screening of genomic library suggested that

C. elegans may have opsin genes and this should be confirmed by examining their sequences.

This is one of the aims of the research work of this thesis.

1.4 Some criteria of nucleic acid hybridization.

Nucleic acid hybridization is the only efficient method to detect a predicted target

sequence in a complex nucleic acid mixture (Wetmur, 1991). It is the basis of Southern and

Northern hybridization and PCR techniques The principle of nucleic acid hybridization is based

on the formation of a helix From two complementary or partially complementary polynucleotide

strands Under specific conditions, using a labeled poly- or oligo-nucleotide as the probe, a

nucleic acid strand containing a sequence complementary to the probe can be identified in a

nucleic acid mixture by the formation of the probe-target duplex, or "hybridn (these terms usually

refer to the joined two strands without regard to structure) The strategy of hybridization is how

to set optimal conditions such as time, nucleic acid concentrations (probe and target),

temperature, ionic strength and the degree of complementarity between the probe and the target

sequences (Britten and Davidson, 1985, Wetmur, 199 1)

When the probe and target sequences have only isolated regions of similarity, the duplex

will consist of base-paired regions (hybridizing regions) bordered by loops and fiee ends. The

Page 24: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

degree and extent of complementarity between probe and target, or in other words, the sequence

similarity and length of the hybridizing regions, determine the effectiveness of nucleic acid

hybridization in detecting a target sequence. They govern both the rate of initial duplex formation

and the stability of the duplex, once formed (Southern, 1985)

Z

The process of duplex formation during hybridization is the same as that during

renaturation of double-stranded nucleic acid. It contains two steps:

I) nucleation the joining together of two fblly separated complementary strands by strict base

pairing (A - T and C - G)

2) zippering the rapid formation of successive base pairs along the molecule, not necessarily by

strict base pairing

The reaction rate is driven by the first (rate-determining) step (Fig 1-3, A) If the probe and the

target sequences share several long enough regions with high homology (> 70 %), nucleation will

occur, then zippering will occur rapidly to form a hybridizing region (Britten and Davidson, 1985,

Wetmur, 199 1)

The stability of a hybridizing region once formed depends on a two-step dissociation

reactions (Fig 1-3, B).

1) When temperature T is close to but not higher than the melting temperature, Tmm of the

hybridizing region, denaturation is reversible. Ths means the rates of renaturation and

denaturation are the same, and the hybridizing region will be partly dissociated. As T rises

towards the Tmm of this region, the degree of dissociation will increase, mismatching regions

first, then the A-T rich regions

Page 25: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

Fig 1-3 Tam and the two-step processes of hybridization and hybrid dissociation

A) Two-step process of hybridization B) Two-step process of hybrid

dissociation (Adapted from Wetmur, 1991)

Page 26: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c
Page 27: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

2) When T > Toom, irreversible denaturation will occur, and the two strands will completely

separate.

Thus, the melting temperature, Tmm, of the region is a measure of its stability. t

The equation relating mehing temperature of a region to salt concentration, and the

region's % W, length D, and % mismatch P, is: [ N ~ + I 500

Tmm ("C) = 81.5 + 16.6 log10 + 0.41 (% G + C ) - - P 1.0 + 0.7 [ Na+] - D

A 1 % increase in mismatch between the hybridizing sequences can decrease Too, by 10C

(Wetmur, 199 1). The equation applies when both probe and target are free in solution.

However, when a filter-bound hybrid duplex is washed at low stringency, the dissociated strand is

removed relatively more slowly. The resulting high local concentration shifts the equilibrium

towards higher duplex stability and Tmm (Britten and Davidson, 1985)

If mismatches are distributed evenly over the entire length of the shorter strand of the

dbplex, the upper limit of mismatches is 30 % (70 % similarity). Above this, the optimal

temperature for hybridization (Tam - 250C) will be lower than temperatures at which

hybridization normally occurs (Britten and Davidson, 1985; Wetmur, 199 1). But there are many

cases of stable hybridization in which the over-all similarity between the two sequences is lower

than 70 %, some even lower than 40 % (O'Tousa, et al., 1985). In these cases, mismatch is non-

uniform and there are local regions of high similarity.

Thiscan be understood by considering the dissociation reaction. Complete separation of

Page 28: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

the duplex strands requires irreversible dissociation of all locally similar regions, and irreversible

dissociation occurs at T > Tam, therefore the temperature at which a duplex will separate

depends primarily on the Tmm of the most stable local region.

My objective is to examine locally similar regions of the probe-target sequences identified

by Ifasta, select the most stable region and determine if the duplex would form a stable hybrid.

The Ifasta program provides the length of each similar region @) and its % identity (100 - P).

The question is what is the minimum D and maximum P under the hybridization conditions. The

literature provides very few guidelines. There appears to be a lower limit on D for a stable

hybridizing region. In electron microscopic observations, a minimum length of 30 f 10 bp was

observed for a 100 % identical region of the conalbumin gene (Oudet and Schatz, 1985). The

minimum stable region with mismatches would be significantly larger. Note in the equation for

Tmm that ratio 500 1 D becomes very sensitive to D for D < 100, that is small changes in D would

produce large changes in Tmm. Therefore I chose the minimum size for a stable hybridizing

region to be 100 bp, since above D = 100 the Tmm is changed by a maximum of 5•‹C.

Once the limit on D is chosen, the equation can be used to estimate the maximum P as at A

the given hybridization temperature and salt concentration (stringency level). This will be done in

Chapter 4.

Thus, if two DNA fragments share some short regions with > 70 % sequence similarity,

riucleation will occur, and if the neighboring region is longer than 100 bp with a mismatch less .& *

than the maximum P, the local hybridizing region will be stable and the probe-target duplex will

remain hybridized. This makes it possible to use a heterologous probe containing a short, highly

Page 29: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

conserved region of a gene family to detect an unknown gene belonging to this family using low

stringency conditions, even though many mismatches exist between them.

opsin-like) genes in the C elegans genome.

$

1.5.1 Progress of the C. elegans genome sequencing program.

1.5 The possibility of using computer tools to search for and analyze opsin-related (or

100 x 106 Because of the importance of C. elegans as a model anirhal, its small genome (

base pairs) and the advanced state of the physical map (more than 99 % of the genes are cloned in

cosmids and mapped), sequencing the whole genome became both necessary and possible. This

program has already revolutionized C. elegans biology. Together with genetics, development and

anatomical data, it also has provided a powerful resource for research in other systems. Up to

date, > 70 % of the genome, with more than 12,500 genes have been completely or near

completely sequenced. The total program will be completed by the end of 1998 mddle, et al.,

1997; Waterson et al., 1997; Coulson and the C. elegans genome consortium, 1997). This makes

it possible for an individual gene or its product in C. elegans to be identified in the sequence data

base by its similarity to another gene sequence using computer tools for sequence-similarity search

and comparison. This is analogous to screening a genornic library by hybridization followed by - gene sequencing, which could be replaced by this more efficient procedure.

As all five regions where our phage inserts are located have been sequenced, and the

inserts were selected by hybridization in screening a genomic library, it would be interesting to

test. the feasibility of the computer similarity search procedure to identif) these sequences. For

Page 30: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

this purpose, I have chosen nine sequenced cosmids which are near to and may overlap the five

insert locations and have their sequences. The details will be described in Chapter

1.5.2 Computer tools for sequence-similarity search and comparison.

- e

In the past, it was difficult to identify a new gene or protein through computer database OF

search and sequence-similarity comparison. Because inefficient algorithms were used, programs

usually spent hours for a short sequence comparison (< 200 bp) (Lipman and Pearson, 1985). In

1985, a new computer program named fmtp was offered for rapid and sensitive protein similarity

searches. fmtp used a new algorithm with a modified form of the diagonal method and a

matching score system to detect the identity between two sequences. It speeds the search for

similarities at least 100-fold and can be operated on microcomputers (Lipman and Pearson, 1985).

Then, fmtp was modified to fmtn (for DNA camp--son) and improved to form a package of

computer tools - the fasta system. One tool of the package, Ifasta, is designed for local similarity

comparison which can display all of the regions of local similarity between two chosen sequences

with' scores higher than a set threshold (Pearson, 1990). Meanwhile, another program for rapid

sequence comparison, blast (basic local alignment search tool), was designed to directly

approximate alignments that optimize another measure of local similarity, the MSP (Maximal

Segment Pair) score. It uses the same heuristic algorithms as fmta, but does not allow gaps

which are used for improving matching in fasta alignments. As a result, blast has less sensitivity

in similarity searching than fasta, but its results have higher significance (less chance to be

randomly matched) (Altschal, et. al., 1990). The blast system is generally used for searching large

Page 31: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

databases for similarity to a query sequence whereas fasta is more co y used for comparing

two selected sequences.

(Note: a very recent version of blast, not available for this thesis work, now does allow gaps)

The fasta and blast programs all have some output pararnete;s which can be set by the

user for individual output requirements. fmta also provides the option of changing its score

matrix for specific purposes, while the score matrixes of the blust system are set to satis@ the

requirements of searching a particular database and are unchangeable.

1.5.3 The possibility of using the computer tools to mimic Southern hybridization.

If gaps are disallowed, fasta locates regions of greatest similarity between two sequences

based on identities, allowing mismatches but not insertions or deletions. Southern hybridization

locates sequences of siniilarity in the same way. "n principle, therefore, fasta can be used to

identify regions of highest complementarity of two sequences involved in hybridization. Also, the

similarity score offasta alignments (initl) could be used to estimate the degree of mismatching in

a hybrid. Since the local similarity in the hybrid is important in Southern hybridization, Ifasfa,

designed for local comparisons, may be best for predicting the results 6f hybridization between a

pair of sequences. I will be testing this possibility in this thesis.

4

To be similar to the process of Southern hybridization, there are two important changes

that must be made to the scoring of lfasta alignments:

1. In hybridization, all bases are paired successively without gaps, therefore in lfasta

comparisons, gaps should be disallowed. Therefore I modified the scoring matrix by

Page 32: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

maximizing the penalty scores for gaps fiom -12 (first gap) and -4 (subsequent gaps) to -9999,

and -999 so that the similarity score ofmy alignment requiring a gap will drop below the

threshold and be ignored.

2. In hybridization, G-C base pairs are more stable than A-T pairs because G-C pairs have three

hydrogen bonds, while A-T pairs have only two. I increased the matching score of

nucleotides G and C fiom +5 to +7 so that every G-C pair will receive'1.4-fold higher score

than the A-T pair, and the G-C rich regions or sequences will show higher scores.

This modified DNA score ma- (Table 3- 1 in Chapter 3) can be used in both fW

searches and ljasta comparisons. Scores and alignments fiom such a modified lfasta comparison

will be much closer to the real requirements of complementarity for hybridization. Alignments

selected by the scores can then be analyzed hrther according to additional theoretical

requirements for hybridization. In principle, this process can predict alignments and relative

strength of Southern hybridization.

The MSPs of blast do not allow gaps, therefore in theory its local similarity scores are

more closely reflect the hybridization condition than the unmodified fasta. But its scoring matrix

cannot be modified to reflect base pair strength. Also it does not allow comparisons: it only sends

out one alignment with the highest score for each query. Thus current versions of blast are not as

usefbl to mimic hybridization, although a very high blastn score may suggest a high possibility of

hybridization. The value of MSP scores will be compared with lfasta comparisons for predicting

hybridization.

Page 33: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

Chapter 2

Confinnation that C clqans ~ e n o m i c DNA Fragments

Hybridize with D. melanogaster Rhl cDNA

Page 34: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

My original goal for this thesis was to identi6 the genes of C. elegans which code for

opsin. The usual method to reach it is by nucleic acid hybridization. The previous attempts

identified several regions of the C. elegans genome which hybridize with D. mehogaster Rhl

opsin cDNA, and five phage clones with non-overlapping inserts had heen selected fiom a C.

elegans genomic library by hybridization with the 1.5 kb Rhl cDNA probe (Boom, J., Lobo, K.,

Smith, M.J. and Bun, A. H.. unpublished). Initially, I decided to repeat this work to confirm that P

there are regions in the C. elegans genome with high enough similarity to the probe to trigger the "r

hybridization.

2.1 Choice of the probe.

The probe used in my hybridization work is D. melanogaster rhodopsin Rhl gene cDNA.

This gene was isolated by a cross-species Northern hybridization with a bovine rhodopsin RNA

probe translated from its cDNA (Zuker, et a]., 1985). In the other paper published simultaneously

(O'Tousa, et. al., 1985), it was isolated by Southern hybridization with the same rhodopsin but a

cDNA probe under low stringency. The Rhl cDNA has 1556 nucleotides. The coding region is

fiom nucleotide 172 to 1290 and encodes 373 amino acid residues with 22 % identity to bovine

rhodopsin (Fig. 2- 1). The cDNA probe we used, cloned in the pucl8 plasmid of E. coli JM 83,.

was donated directly by Dr. Zuker.

In addition to the need to repeat previous work in our laboratory with the same probe,

there are other reasons why I chose D. melanogaster Rhl cDNA as the probe:

All known invertebrate opsins have identical sequences in several highly conserved regions.

At the amino acid sequence level, the ratio of identical or conservatively substituted residues

Page 35: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

Fig. 2-1 : D. melanogaster Rhl opsin cDNA sequence and its deduced

amino acid sequence (rhodopsin ninaE) compared with bovine

rhodopsin amino acid sequence. The "ATTAAA" sequence that

probably signals poly(A) addition is underlined. The underlined

sequences with the restriction enzyme name (Pst I) indicate the

restriction cutting sites of Pst I. And the boxed sequence is the

code of the lysine which binds the 1 1 -cis-retinal in opsin.

(%printed from Zuker et al., 1985)

Page 36: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c
Page 37: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

in these short regions is higher than 80 % (Smith, et. al , 1993) All known vertebrate opsins

are even more similar to each other At the whole amino acid sequence level, there are five

opsins fiom different species (human, bovine, mouse, chicken and fish) sharing at least 75 %

identity (Archer, et. a1 , 1992) But there are obvious differences between invertebrate and

vertebrate opsin sequences, either in residue composition or in conserved region positions

For example, one pair of amino acid residues at the border of helix I11 and cytoplasmic loop-2,

a G-protein binding site, is E-R in all vertebrate opsins but D-&in all invertebrate opsins

(Fahmy and Sakrnar, 1993) Thus invertebrate opsin is an a riate choice as the probe for

searching opsin homologous gene sequences fiom invertebr omes such as C. elegam

Compared with vertebrates, only a few invertebrate opsins

melanogaster rhodopsins are the only ones which have been researched in detail and used as

the probes for isolating other invertebrate opsins For example, Llmulus, an arthropod with a

long evolutionary distance From the fly, has two opsins with sequences more similar to D.

melanogaster Rh1 sequence than all other known invertebrate opsins (Smith, et al , 1993)

Therefore, D. melanogaster rhodopsin sequences could be a first choice as a probe for

searching lower invertebrate opsin even though it is a higher invertebrate

Because all G-protein coupled receptors have the same seveo-transmembrane helical

structure and certain conserved sequences, and opsins are a subfamily of G-protein coupled

receptors, a probe with more specific structure than the entire Rhl cDNA was chosen for

searching opsin-like genes Restriction endonuclease Pst I has two restriction sites (Fig. 2- 1) in

the Rhl cDNA sequence which cuts Rhl cDNA (- 1 5 kb) into three pieces.

The 577 nucleotide 5' half

The 82 1 nucleotide 3' half without the 3' end.

Page 38: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

The 158 nucleotide 3' end region

The -0 2 kb sequence is noncoding and includes the poly(A), thus it cannot be used as a probe

The 4 6 kb (5' half) piece has a similar strudture and the conserved regions of all G-protein

coupled receptors, but the -0 8 piece has more opsin-specific conserved regons including the 1 1-

cis-retinal binding site, the lysine residue in helix VII (Fig 2-l), and would be a more specific

probe for opsin-searching I used two probes, the 1 5 kb whole sequence and the 0 8 kb piece, in

my hybridization work

2.2 Southern hybridization.

2.2 1 Media and buffers:

NGM agarose plate (Sulston and Hodgkin, 1988).

3 g NaCl + 8 g agarose + 2.5 g peptone + 1 ml cholesterol (5 mg 1 ml in ethanol) + 975 ml

H 2 0 , autoclave; then add, while hot, using sterile techruque:

1 ml 1M CaCIZ + 1 ml lM MgS04 + 25 mi 1M KH2P04 (pH 6).

0

LB medium (Lech and Brant, 1988), per liter.

10 g Tryptone + 5 g yeast extract + 5 g NaCl + 1 ml 1 N NaOH

Lambda broth (Lech and Brant, 19884, per liter

10 g Tryptone + 2.5 g NaCl

I

Page 39: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

SSC buffer (pH 7 0) SSPE buffer (pH 7 4) TAE buffer (pH 7 2).

0 15M NaCl 0 15M NaCl 0 O4M Tris base

0 0 15M Na3 citrate 0 01M NaHzPOj 0 0 2 ~ " Na acetate 6

0 001M EDTA - Na2 0 001M EDTA - Na2

The above buffer compositions were obtained from the booklet CienescreenTM &

Genescreen PLUS@ Hybridization transfer membrane Transfer and detection protocols E

I du Pont de Nemours & Co ,(Inc ), NEN Products, Boston, MA USA

TBE buffer (pH 8.0) @loore, 1996)

0.089M Tris base

0 089M Boric acid

0 002M EDTA

2 2 2 Preparation of materials

L). melunogaster Rhl cDNA was cloned in pUC 18 1 JM 83 The five Charon 4A phage

clones with C'. elegans genomic D N A inserts (S401 - S405) which hybridize with Rhl probe, /

were obtained from the stock in our laboratory The plasmid / bacterial clone was grown at 37•‹C

in selective LB medium with 0 5 mg / ml ampicillin The lambda phage clones were grown at

37•‹C in lambda broth plus LB media

Phage DNA with the C. elegara genomic inserts was extracted by using the mini

preparation method (0 25 % SDS lysis at 70•‹C for 5 minutes, centrihgation to remove protein,

addition of 119 volume 5M K acetate and 2 volume 100 % ethanol to precipitate the DNA)

Page 40: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

(Shuang - Yourig, 1986) All five inserts were separated from their vectors by EcoR I 1 Hind IIT

digestion, using 10 units enzyme for 0 15 pg phage DNA in 20 pl One-Pho-All PLUS buffer

phannacia Biotech , Uppsala, Sweden) at 37•‹C for 1 hour This was followed by electrophoresis

on a 0 8 % agarose gel (0 5 x TBE + 0 5 pg / ml ethdium bromide, IOOV, 90 minutes) for

subsequent Southern blotting

Plasmld DNA with probe was extracted using the hlagicTM hGnipreps DNA Purification

System Wiomega Corporation, Madison, WI USA, Cat # A7 100) It was digested by EcoR I

(EcoR I / Pst I to obtain the 0 8 kb probe), using 10 units enzyme per 1pg plasmid DNA in One-

Pho-All PLUS butTer at 37•‹C for 1 5 hours M e r electrophoresis on a 1 0 % NuSieve low eSP

melting temperature agarose gel (1 x TAE + 0 5 pg / rnl ethidium bromide, 80V, 105 minutes),

the Rh1 probe band (1 5 or 0 8 kb) was cut out, extracted by adding 1 volume phenol, and

prec~pitated by addmg 1/2 volume 3M Na acetate and 2 volume I00 % ethanol

C'. e l e p m gcnomic DNA was prepared by.

Cultivating worms (C'. r1c.gcrri.s N2 strain) on bacterial lawns growing on NGhl agarose plates.

Washing OR and sedimenting worms, then digesting with 1 mg 1 rnl proteinase K at 65OC until

complete

Extraction of DNA in 1 volume phenol followed by addition of 119 volume 4M mi4 acetate,

then 1 volume 100 % isopropanol was added to precipitate the DNA. DNA was rinsed using

70 % ethanol

EcoRI digestion I 0 units enzyme per 1 pg genomic DNA in One-Pho-All PLUS buffer at

37•‹C for 1 5 hours The digest was separated by electrophoresis on a 0 8 % agarose gel (0 5

x TBE + 0 5 pg ethidium bromide, 80V, 120 minutes) for subsequent Southern blotting.

Page 41: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

Genornic DNA from another nematode with light sensitive behavior, Mennis nigrescens, was also

extracted for hybridization work following the same procedures.

2.2.2 Southern blotting and hybridization:

The gel was acidified in 0.25N HCI, neutralized in a solution of O.4N NaOH in 0.6M

NaCI, then in another solution ofO.5M Tris-HC1 (pH 7.5) in 1.5M NaCI. The gel was then

capillary blotted (filter paper wick) overnight with 10 x SSC as the transfer solution (salt transfer)

onto ~ e n e ~ c r e e n T M membrane [E. I. du Pont Nemours & Co. (Inc.), NEN Products, Boston

hIA USA]. he blot mern8ane was washed in 04N NaOH, neutralized in a solution of 0.2M

Tris-HCI in 1 x SSC The DNA was * W autocross-linked (254 nm, 1200 pW / cmz), and kept

wet until hybridization

*The method of UV autocross-linking was obtained from the booklet: ~ e n e ~ c r e e n ~ ~ &

Genescreen PLUSB. Hybridization transfer membrane. Transfer and detection protocols. E. I t.

du Pont de Nemours & Co ,(Inc.), NEN Products, Boston, MA. USA.

The blot was prehybridized at 550C in 5 x SSPE, 5 x Denhardt's (Denhardt, 1966), 0.3 %

SDS and 0 1 mg / ml salmon sperm DNA for more than 4 hours. The Rh1 probe f1.5 kb or 0.8

kb) was labeled with 3 2 ~ - ~ ~ ~ (specific activity of - lo9 cpm 1 pg) using Oligolabelling Kit

(Pharmacia Biotech., Uppsala, Sweden) according to the instructions, denatured, added to the,

hybridization buffer (- 3 ng / ml) and hybridized at 550C overnight (> 18 hours). The blot was

washed at 550C in 1 x SSPE for one hour, repeating three times. Kodak X-omat X-ray film was

exposed for at least 96 hours with an intensifying screen

Page 42: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

Fig. 2-2: Hybridization of nematode (C. elegans and Mermis nigrescens)

genornic DNA digested by EcoR I with the 1.5 kb 3 2 ~ - labeled

D. melanogast& Rh 1 o psin cDNA probe.

Left figure: the gel with 2.0 pg 1 lane DNA samples before Southern blotting.

0.35 pg lambda phage DNA digested by BstE II was used as the P

reference ladder (UV light, ethidium bromide stain).

Right figure: the X-ray film after 96 hours exposure to the blot with an intensifjlng

screen. The arrows indicate the hybridization bands.

Page 43: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c
Page 44: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

2.3 Results and discussion.

After many trials, I established an effective, lower-moderate stringency condition for the

Southern hybridization. For EcoR I - digested C. elegans and Mennis nigrescens genornic DNA

hybridized with the 1.5 kb probe, each blot shows two faint hybridization bands at the same sizes

both > 8.5 kb (Fig. 2-2, Right). A weaker hybridization band can be observed at 4.3 kb for C.

elegans but at 4.8 kb in Mermis.

Compared with the hybridization results found previo"sly by John Boom et a]. (see

section 1-3), the genomic digest had fewer hybridizing DNA fragments and they were longer in

size. Also a lower stringency was necessary. Thus I could not repeat John Boom's genornic blot

with the different techniques and conditions I used. The smaller quantities of sample (2 pg I lane,

only 115 of Boom's sample), different labeling techniques and different specific activity of labeled

probes may be responsible for the weaker hybridization'signal such that some hybridization bands

might be too weak to be observed. Since my hybridization needed low stringency conditions

(which caused the high background), and long exposure time to produce few and weak

hybridization signals, it appears that using cross-species hybridization with the Rhl opsin cDNA

probe to identi@ similar sequences in C. elegans genomic blots is very difficult.

However, there is no doubt that the cloned C. elegans genomic DNA fragments can

hybridize with the Rhl pro5e on Southern blots. For EcoR I 1 Hind I11 digested DNA from five

phage inserts (S401 - S405) hybridized with the 1.5 kb probe (Fig. 2-3, Right), there is one

relatively strong hybridization band in each of five sample lanes:

S401: 1.5kb.

Page 45: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

Fig. 2-3: Hybridization of phage inserts, S401 - S405, digested by EcoR I

/ Hind I11 with the 1.5 k.$32~ - labeled D. rnelanoguster Rhl opsin

cDNA probe. \ Left figure: the gel with 35 ng / lane DNA samples before Southern blotting.

0.175 pg lambda phage DNA digested by BstE I1 was used as the

referenct ladder (UV light, ethidium bromide stain).

Right figure: the X-ray film after 96 hours exposure to the blot with an intensifjmg

screen.

Page 46: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c
Page 47: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

S402: 2.8 kb.

S403: 3.2 kb.

S404: 1.1 kb (weaker).

S405: 3.6 kb.

The result of hybridization between the phage clone inserts and the 1.5 kb Rhl probe were

as expected. All five inserts hybridized with the Rhl probe were approximately the same sizes as

observed previously by Boom et al. (most within 0.1 kb; compare Fig. 2-3 with Fig. 1-2).

However, at the lower stringency and much longe; exposure time, weaker signals and, higher

background were observed. This may be explained by the much smaller sample quantities I used

(35 ng phage DNA per sample lane, 1/70 of Boom's sample), different labeling techniques and

different specific activity of labeled probes.

Page 48: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

Chapter 3

Searching Currently Sequenced Regions of the C elegans

Genome for Sequences Similar to D. melanogaster Rhl Probes

Page 49: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

The results of Southern hybridization suggest that some regions of the C. elegans genome

are similar to part of the D. melanogasfer opsin Rhl cDNA sequence. Since most of the C.

elegmrs genome has been sequenced, and the rest will be sequenced soon (Coulson and the C.

elegmzs genome consortium, 1997), it is feasible to use the computer search tools, fma and

blast, with Rhl probe sequences as the query sequences to identifjl regions with similar sequence

in the C. elegans genome.

3.1 The basic features of computer searching systems.

Both blast and fmta use a measure of similarity between two sequences to distinguish

biologically significant identities fiom random chance similarities. Instead of the commonly used

dynamic programming algorithm which needs a supercomputer or other special purpose

hardware, blast and fasta employed a heuristic search algorithm so that it allows large databases

to be searched on commonly available computers (Altschul et. al., 1990). fasta was developed

earlier than blast and first used a similarity score matrix to find locally similar regions between

two sequences (Lipman and Pearson, 1985; Pearson and Lipman, 1988; Altschul et. al., 1990).

The search algorithm in blast uses a measure called an MSP score (the Maximal Segment

Pair score) whkh is based on a score matrix defined for DNA or protein sequences An MSP is

defined to be the highest scoring pair of identical length segments chosen fiom the two sequences.

The boundaries of an MSP are chosen to maximize its score, so an MSP can be of any length.

The MSP score provides a measure of local similarity for any pair of sequences. blast can seek

out all locally MSPs (usually called HSPs - High-scoring Segment Pairs) fiom one pair of

sequences with scores above a fixed cutoff score (Altschul et. al., 1990). Because blast and farta

Page 50: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

use the same scoring principles, a high-scoring segment pair fiom a b1a.s~ result is usually the same

as a fasta alignment. The only difference is that gaps are not allowed in selecting the HSP by

blast (Lipman and Pearson, 1985; Pearson and Lipman, 1988; Altschul et. al., 1990).

blast and fasta both use a set of score matrixes for their measurements. All of the

important parameters and options of fasta are included in the matrix which is stored as Smatrix

and can be opened and edited by using the -s option. As described at section 1.5, I have changed

the DNA score matrix to better mimic the nucleic acid hybridization in my own fasta and rjasta

comparisons (the original and the modified DNA score matrix are shown in Table 3-1). The

matrixes of blast are not alterable, especially in databases which have their own blast services.

The only way to make the search satisfy special requirements is by resetting the values of

parameters, and this is more limited.

3.2 The blast system. \,

The blast system includes five programs which perform the following tasks (Altschul et.

al., 1990):

- blasp:

- blasrn:

- blasa:

compares an amino acid query sequence against a protein sequence database

compares a nucleotide query sequence against a nucleotide sequence database.

compares the six possible reading frame translation products of a nucleotide query

sequence (both strands) against a protein sequence database.

- tblastn: compares a protein query sequence against a nucleotide sequence database

dynamically translated in all six reading fiames (both strands)

Page 51: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

Table 3- 1 . The comparison of standard and modified fasta DNA score matrixes

standard - 4 7 - 4 - 4 7 - 4 - 4 - 4 3 1 - 1 1 - 1 I

modrfied -1 1 -1 1 -4 I I 1 -2 - 2 - 1 -1 I 1 -2 -2 P - 1 -1 -1 I

-1 1 1 - 2 -1 - 1 -1 -1 1 -2 - 1 1 I - 1 -1 - I -1 -1 1 1 - 2 1 1 1 - 1 - 1 1 - 1 1 1 1 1 - 2 L - I 1 1 1 - 1 - 1 - 1 I 1 I 1 - 2 I -1 1 -1 1 - 1 -I -1 1

- 1 1 I 1 - 1 I -I - 1 1 1 -1 -1 -1 1 ? , I 1 - I - I - I -1 - 1 -I -I - 1 -1 -1 1

-

I : ? or : 0. this comment. I I p r r v n ~ b u s d lo determine v k t h c r xquuvcs should bc l r k l r d u

amino x i d s (aa) or nudeaidel (nt).

5 Tl~c alphdbcc 'me program ~uromnicll ly converu u p p r to lower c u e and vice.venr

w u l d have cach or LhCr charutcrx trcarcd u 0. Thc lowest h u h vduc hould be 0

7 K The lover virnglc o f Ihc symmetric vodng m r t n r Thee h u l d be errcJy u many l i n u r l &re

Page 52: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

- rblasir compares the six-frame translations of a nucleotide query sequence against the six-

frame translations of a nucleotide sequence database. -

The blmr programs used in this thesis are blastn (this chapter) and blasp (Chapter 5).

The sensitivity and speed of a blust search can be adjusted through setting parameters w

(word size), T (neighborhood word score threshold) and X (word hit extension drop-off score).

An HSP is identified beginning with finding a segment of length w in the query sequence that has

a score equal to or higher than T when aligned with a same length segment in a database

sequence Then this pair of segments is extended in both directions along each sequence The

extension is stopped when the cumulative HSP score drops down by the quantity X fiom its

maximum achieved value The HSP with the maximum achieved score will be reported (Altschul

et al , 1990)

The statistical significance of the reporttd HSP is estimated by the parameter E (statistic

significance). E is defined to be the upper bound of the expected frequency of random chance

occurrence of an HSP within the context of the entire database search. Under the random

sequence model described by Karlin and Altschul(1990), E can be interpreted as the expected

number of matches hap~enina onlv by chance during the search. It is related to the HSP score S

by the Karlin - Altschul formula:

E = KN~-LS

in which N is the product of the query and database sequence lengths (the size of the search

space); and K and L are ~ a r l i n - Altschul parameters (Karlin and Altschul, 1990; 1993) E or S

can be set by the user to limit the number of reported HSPs. blast will only report the HSPs with

Page 53: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

the score above preset S and the chance of random matching lower than preset E (Altschul et a1 . * 1990) If E and S are both preset by the user, the stricter one would be used by blast.

As an estimate of the significance of a HSP, blmt calculates P = 1 - e-E, the probability of

an HSP formed only by random chance (probability of random matching) The smaller is P, the

more sigruficant is the HSP E and P approach equality at values 5 0 04

3.3 The farto system.

Thefusta system contains a total of 12 programs (Pearson, 1995; Introduction of version

2.0 offasta program):

- Sequence search programs: fasta, fmta, ssearch and align.

- Local similixity programs: ljiwta, plfQSta, perfasla (Unix only), lalign and plalign.

- Statistical significance programs: prdJ relate and prss.

The fasta programs used in this thesis are fasta (Chapters 3 and 4) and lfarta (Chapter 4):

- fa~ta: universal sequence comparison between two protein or DNA sequences. It combares a *

query sequence to a sequence or library of sequences provided by the user, and reports

only the best one alignment between the query sequence and each of the library

sequences. Usually it is used to search for the best alignment with a query sequence.

- lfis~a: local similarity searches showing local alignments. It compares two sequences looking

for local similarity and will report all the alignments between the two sequences with

scores higher than a cutoff value. Usually, it is used to select matching alignments

which satis5 the user's special requirements between the chosen two sequences.

(Pearson and Lipman, 1988; Pearson, 1990).

Page 54: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

All jara searches match two sequences through four steps and three scores, m~rl. mtn

\ and opt, are calculated and reported (Table 3-2) fasta uses the "diagonal" method to find all

regions of similarity between the two sequences, counts matches and penalizes for the htervening

hismatch, then identifies regions of a diagonal that have the highest density of matches (Lipman

and Pearson, 1985) The initial scoring of fasta alignments use exactly the same principles as

those used for blast MSP, but fmta then uses the Smith - Waterman algorithm to optimize the

initial scores with mtroduced gaps (Smith and Waterman, 198 1 )

Table 3-2 Characterization of sequence similarity by fata (Four steps)

Step I

Step 2

Step 3

Step 4

Identify regions shared by the two KqUerICCS with the highest density of identitles (krlrp - I ) or pain of identities (&/rip - 2).

Rescan the ten regions with the highest density of identities using the PAM250 mafrix(rr wA mav;;r) Trim the ends of the region to include only t h o r residues *

contributing to the highest score. Each region is a partial alignment without gaps. (FASTA only) I f there are several initial regions with scores greater than the CUTOFF value, check to we whether the trimmed initial regions can be joined to form an approximate alignment with gaps. Calcdate a similarity score that is the sum of the joincd i'nitial regions minus a penalty (usually 20) for each gap. This initial similarity score (in& is u x d to rank the library sequences. The score of the single best initial region found in Step 2 is reported ( i n r f l ) ; i t is the u m e as the initial similarity score clllculared by FASTP.

Construct a NWS optimal alignment of the quew sequence and the library se- quence, considering only t h o r rcs~dues that lie in a band 32 residues wide centered on the best initial rcgion found in Step 2. FASTA and FASTP both report this score a the optimized (opr) scorc.

Page 55: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

The sensitivity and speed of a f&a search can be set through parameter khrp. ktup

determines how many consecutive identities are required in a match. Its default value is two for

protein or six for DNA. When ktup = 1 (for protein) or 3 (for DNA), the fasta search is most

sensitive but with the lowest speed (Lipman and Pearson, 1985, Pearson, 1990).

.. The statistical significance of afasta similarity score can be estimated by using Karlin -

Altschul statistics (E and S) as in blast earl in and Altschul, 1990, 1993) But in the fasta

system, routines developed fiom a program, rdf[ can be used to estimate the sigruficance directly

by using the Z score:

(similarity score - mean of random scores) z =

(standard deviation of random scores)

The mean of random scores was obtained by comparing the query sequence with a randomly

shuffled target sequence If Z > 10, the score is significant, 10 > Z > 3, possibly, or probably

significant, Z 3, insignificant (Pearson and Lipman, 1988)

3.4 Results of blastn searches of ncbi and ACeDB databases.

The query DNA sequences used with blmm searches are those of the two probes used in

the Southern hybridization: the 1.5 kb whole cDNA sequence of Rhl opsin gene, and the 0.8 kb P

Pst I fiagment (the 3'-half without most of the ndncoding end). Since the databases which I

searched require use of their own blast services, I sent the query sequences to the service, and

reset all output parameters according to my requirements Two databases were searched by using

b h m : ncbi and ACeDB.

Page 56: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

a- Because ncbi is a huge database, it includes a lot of genomic or cDNA sequences of other

* 3 species with high similarity?o the two probes. Therefore I had to expand the search to obtain

33. enough C. elegans sequences €y % setting E = 1000 (the upper limit). I also lowered the cutoff

score S to 100, and set-maximum number of sequences output V to 3000. Then, because the

resulting document was too long to be transfer fiom ncbi to me, I set parameter B to 0 to cancel

'the showing of alignments. Thus, out of the - 3000 output sequences , I obtained 85 C. elegans

cosmid sequences with the 1.5 kb probe (lowest score was log), and 143 with the 0.8 kb probe

(lowest score was 101). The top fourteen of the 85 sequences and the top eight of the 143 had

random matching probability P < 0.05. Those 22 sequences are shown in Fig. 3-1 with their

- scores and P values.

The ACeDB database includes only C. elegans cosrnid sequences. The ACeDB blast

service has fixed E = 10, and would not send out any output sequence with P = 1 or more. Even

though I set V = 500 and B = 500 (the upper limits), it still used the above limitations.

Nevertheless, I obtained enough output sequence data: 229 sequences with the 1.5 kb probe

(lowest score was 103) and 7 1 sequences with the 0.8 kb probe (lowest score was 101). Figure

3-2 lists those with P < 0.05, including 35 sequences with the 1.5 kb probe (including ten with

sequence unfinished or locations unknown) and 17 sequences with the 0.8 kb probe (including

two with sequence unfinishe+%%btions unknown). i

\ 3.5 Result of fasta searches o f the EMBL invertebrate genome database.

The EMBL database hasjasta service, and allows several offsets which restrict the search

area to invertebrate genomic sequences. But only part of the C. elegans database is included.

Page 57: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

Fig. 3-1: The blastn searches of the ncbi DNA sequence database. The

C. elegans cosrnids with P < 0.05 from searches with A) 1.5 kb

query sequence and B) 0.8 kb query sequence are shown.

Page 58: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

NCBI BLAST Search Results

Query = ninaE gene =DNA sequence (1556 b$) (1556 letters)

Database: Non-redundant GenBanktC'iBLtDDaJtPD3 sequenczs 329,017 sequences; 497,918,314 total letters.

Sequences producing High-scoring Segment Pairs:

gblU648571CELC37C3 , gb I U4l264 1 CELFlOE7 e m b l Z 5 0 8 7 4 1 ~ ~ ~ 1 0 ~ 4 emb l Z68Olll CET2106 emb 1 Z22l79 (CEF58A4 emblZ681081CET05AlO emblZ677561CER07A4 emblZ832371CER06B3 gblL236481CELF4409 gblU396531CELT13H2 emblZ822701CEF53H2 gblU587511CELC07Gl emblZ814941CEF02E9 ernblZ82087 1 CEZK254

Caenorhabditis elegans cosmid C37C3. Caenorhabditis ~legans cosmid F10E7. Caenorhabditis elegans cosmid RlOE4 Caenorhabditis elegans cosmid T21B6 Caenorhabditis elegans cosmid F58A4 Caenorhabditis elegans cosmid T05A10 Caenorhabditis elegans cosmid R07A4 Caenorhabditis elegans cosrnid R0639 C. elegans cosmid F44B9. Caenorhabditis elegans cosmid T13H2 Caenorhabditis elegans cosmid F53H2 Caenorhabditis elegans cosmid C07G1. Caenorhabditis elegans cosmid F02E9 Caenorhabditis elegans cosrnid ZK254

NCBI BLAST Search Results

Query = nina E cDNA PstI cut 0.8 kb probe (821 nt) (821 letters)

Database: Non-redundant GenBanktEMBLtDDBJtPDB sequences 329,017 sequences; 497,918,314 total letters.

Sequences producing High-scoring Segment Pairs:

gblU648571CELC37C3 emblZ508741CERlOE4 gblU412641CELFlOE7 gbl L23648 lCZLF44B9 gblU587511CELC07Gl emb I Z68Olll CET2186 embIZ814941CEF02E9

\ enb l 282087 1 CEZK254

Caenorhabditis elegans cosmid C37C3. Caenorhabditis elegans cosmid RlOE4 Caenorhabditis elegans cosmid FlOE7. C. elegans cosmid F4439. Caenorhabditis elegans cosmid C07Gl. Caenorhabditis elegans cosmid T21B6 Caenorhabditis elegans cosmid F02E9 Caenorhabditis elegans cosnid ZK254

High Score

Smallest sum

Probability P (N) N

Smallest Sum

High Probability Score P(N) N

Page 59: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

Fig. 3-2: The b h t n searches of the ACeDB DNA sequence database. Cosmids

\ with P < 0.05 fiom searches with A) 1.5 kb query sequence and B)

0.8 kb query sequence are shown.

Page 60: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

Query- nlnaE gene cDNA sequence (1556 bp) (1556 letters)

Database: /nfs/disk100/wormpub/analysis/Sequen~e~Databases/allcmi3 6587 seauences: 101.658.307 total letters.

Smallest S urn

High Probability Sequences producing Higtl-scorinq Segment Pairs:

Cosmid=Y64GlO; Contig ID-01441; Lenqth-1097; Order-Unkno.. C37C3 R10E4 T2136 F10E7 F58A4 F13H6 TO 5AlO R07A4 R06B9 F44B9 T13H2 CC4 F53A3 F53H2 aOlb12.00928 STLOUIS L'NPIHISHED DATA FRON CHROHOSOME 1 Cosaid-Y7A9; Contig 10-00336; Length-203925; Order=Un)tno... C07Cl F44A6 . BOO41 F02E9 C32A3 Cosmid-CllFlO: Contig Ib00412: Length-12683; Order-Unkn... Y7A9C C48C7 FOBa6.Contig20 STLOUIS UNFINISHED DATA FROM CHROMOSOME I F3982 Cosmfd-Y6~3; Contig ID-00388; Length-10577; Order-Unknou. .. F53B7 Cosnid-Y3936: Contiq ID=OO318; Cosmid-30413; Contig ID-00699; F15A4 Cosmid-Y54E2; Contig ID-01907; Cosaid=Y54E5; Contig 10-00725; F13HB

B U S T N

Length=l641; Order-Unknov. .. Length-40567; Order-Unkno . . . Length-162212; Order-Unkn ... Length-89018; Order-Unkno . . .

Quary- nina E cDNA PstI cut 0.8 kb probe (821 nt) (821 letters)

Database: /nfs/diskl~~/uorn~ub/analysis/Saquance~~atabases/al~~mid 6587 sequences; 101,658,307 total letters.

Smallest 5u3

High Probability Sequences producing High-scoring Saqment Pairs: Score P(N)

2421 lines more (you've seen 18) TNVS220 - Novell, Inc. fraser.sfu.ca (1) Rep

Cosmid-Y64GlO; Contig ID-01441; Lenqth=lO87: Order-Unkno. R10E4 C37C3 F10E7 F44B9 C07C1 F13H6 T2186 FO886.ContigZO STLOCIS LTFINISHED DATA FR3H CmOHOSOME I F02E9 F53A3 R1282

* C1489 ZK783 C07A12 F48Cll ClOC5

Page 61: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

Most of the parameters, including score matrix and gap penalty, have been fixed by the M L

fasta service. Only thmutput number of alignments can be set by the user with the upper limit =

100. Therefore, few C. elegans sequences were obtained.

Among the - 100 output sequences for each probe, 14 C. e1egun.s cosmid sequences were

obtained with the 1.5 kb probe search ( i ~ t l scores from 8 1 to 127) and 19 sequences were

obtained with the 0.8 kb probe search ( i ~ t l scores fiom 94 to 125). These cosmids are listed with

their three scores in Fig. 3-3.

3.6 Cosmid sequence selection for further local similarity comparisons.

Because all resulting C. elegans cosmid sequences of blastn searches obtained with the

Rhl opsin query sequences have similar HSP scores (101 to 193), it is better to use the P value of

the alignments as the criterion to select the cosmid sequences with high similarity score by the

least random chance. I chose the six cosmid sequences with the lowest P values fiom each blmtn

search of ncbi (1.5 kb query and 0.8 kb query). A total of eight cosmid sequences were selected.

Four of them appear on both lists in similar order. These e listed in Table 3-3, and will be used

in local similarity comparisons in Chapter 4. 7.

The results of the blastn searches of ACeDB are almost same as the results of the ncbi

database (Compare Fig. 3-1 with Fig. 3-2). If Y64GlO and F13H6 are ignored, exactly the same

eight cosmid sequences occur on both lists with almost same order of HSP scores and P values.

These are listed in Table 3-3. Y64G10 and F13H6 are two newly sequenced cosmids. Y64G10

has not been sequenced completely and its HSP score may be artifactually high. F13H6 is --

%

Page 62: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

Fig. 3-3: The fasta searches of the EMBL invertebrate genome database.

Highest scoring C. elegans cosrnids fiom searches with A) 1 .5 kb

query sequence and B) 0.8 kb query sequence are shown.

Page 63: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

(Nuc1eo:id~) FASTA of: 20a05307.Seq fron: 1 to: 1,556

TO: EINV: Sequsnces : 16,744 S.ynbols: 67,079,274 h'ord Size: 3

The best scores are: initl initn opt..

Eninv:Cet07a9 229094 Caenorhabditis eleqans cosnid C07A9... 97 226 136

Eminv:Cec34~6 266494 caenbrhabditis elegans cosnid C34C6 . . . 9 0 222 102

P Eminv:Cenolfl /rev 256381 kaenorhabditis elegans cosnid . . . 91 212 9 9 . Eminv:Cerllgl /rev U41016 CaenorhaMitis elegans cosnid ... 86 211 9 1 . . _ - . .

Eminv:Cef O7c6 /rev 269659 caenorhabditis elegans coinid . . . 103 203 I03

Eainv:~ezk84 U23181 Caenorhabditis elegans cosmid ZK84. . . . 81 200 81 Eminv:CerlOe4 /rev 250874 caenokhabditis elegans cosnid . . . 125 197 149 Eminv:Cer07a4 267756 caenbrhabditis elegans cosnid R07X4.. . 127 192 183 Eminv:Cet05a10 268108 ~aenorhabditis elegans cosmid T05X... 127 192 183 Eninv:Cef44a6 250858 ~aenoihabditis elegans c o ; n i d . ~ ~ ~ ~ k . . . 116 191 123

- ~ninv:~eriohlo 270686 ~ a e n o r h a ~ d i t i s elegais coinid R ~ O H . . . 111 183 160

~minv:c&o~i6 U23516 ca;norha&itis eleqans cosnid Bo:16. . . 107 161 124.

E?inv:Cef07c3 c50306 Caenbrhabditis elegans cosnid FO7C3 ...' 114 180 147

Eminv:CK54h2 US8728 CacnorhaMitis elegans cosnid C54H2 ... 111 176 117

(Nucleotida) FASTA of: 20a024b:.Seq fron: 1 to: 821

TO: EINV:* Sequences: 18,744 Symbols: 67,079,274 Word Size: 3

The best scores are: initl initn opt..

Eminv:CerlOe4 /rev a50874 Caenorhabditis elegans cosmid ... 113 197 185 Eminv:CetZlb6 /rev 268011 Caenorhabditis elegans cosmid . . . 125 193 126 Eminv:Cec07gl /rev U58751 ~Henorhabditi; elegans cosmid . . . 119 178 127 h i n v : ~ e c l 4 b 9 /rev L.15188 C. e1;gans cosnid ~ 1 4 ~ 9 . 10/94 117 172 117 Eminv:Cec32a3 /rev 248241 caenorhabditks elegansLcosmid ... 114 172 133 Eminv:Cer07a4 267756 Caenorhabditis elegans cosmid RO7A4 ... 111 170 142 Eminv:Ce:55dlZ /rev 275542 Caenorhabditis elegans cosmid . . . 101 166 110 Eminv:Cezk783 /rev U13646 Caenorhabditis elegans cosmid ... 103 163 lu3 Eminv:~et05al0_ 268108 ~aenorhabditis'elegans cosmid T05A . . . 111 160 142 Eminv:Ce:56h9 /rev 274471 ~ a e n ~ r h a b d i t i s elegans cosmid ... 96 159 169 Eminv:Cec;9h3 U52436 CaanorhaMitis elegans cosnid C49H3. . . 111 159 111 Eminv:CeclBh9 U23147 Caenorhabditis elegans cosmid Cl8H9 ... 106 159 121

~ n i n v : ~ e f 57b9 /rev U13876 Caenorhabditis eleghns cosmid . . . 99 158 139 Eminv:~ef2~cll /rev 247072 Caenorhabdltis~elegans cosnid.. . 99 157 107 Eminv:CetlGh9 U41746 Caenorhabditis eleqans cosmid Tl8H9 . . . 106 155 143 Eminv:Cef07c3 U50308 Caenorhabditis elegans cosnid FO7C3 . . . 106 155 143 Eminv:CerO4d3 /rev 270212 Caenorhabditis elegans cosmid .. . 96 154 9 8 hinv:Cetl7h7 - . /rev U42841 Caenorhabditis elegans cosnid ... 104 154 106 Eminv:Cefl3b9 U39853 Caenorhabditis elegans cosmid F13B9..- 94 15] lo4

Page 64: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

now sequenced but the complete sequence was not reported in ACeDB. Neither cosmid has been

reported to ncbi as of July 15, 1997. Therefore I had to ignore them. They should be

investigated once complete information is available.

The fasta search results of the LWBL were examined for the above eight selected cosmid

sequences. The sequences found were R10E4, T2 lB6, C07G1 and T05A10 with the 0.8 kb

query, and T05A10 and R10E4 with the 1.5 kb query. These had among the highest initl scores

(the first calculated score for the alignment without gap) of all the C. elegans cosmids reported.

There are two reasons why the others of the eight selected cosmids were not on these two lists.

First, sincejasta of W B L arranges its output in the order of initrr'scores, which are calculated

with gaps allowed, some sequences with lower initl scores would be listed higher, displacing the

other sequences. Second, only the first 100 invertebrate sequences were obtainable for each

search, and the other C. elegans cosmids could have been displaced by other invertebrate

sequences having higher similarity to the queries. More recent searches listed no C. elegans

cosmid with the 1.5 kb query and only five of them with the 0.8 kb query. Apparently many

newly reported invertebrate sequences are more similar to the queries than C. elegans cosmids.

Thus, the limited results of the fasta searches of the EMBL invertebrate database partly support

the eight cosmids selected by the blastn searches,

Because the highest HSP score of the blastn searches is only 193, and the highest initn

score (with gaps to improve matching) of fasta searches is only 226, it is unlikely that a long

region in the C. elegans genome is significantly identical to the D. melanogaster opsin cDNA.

But these genome searches indicate that some short identical regions between the C. elegans

Page 65: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

Table 3-3: Selected C. elegans cosmid sequences with high similarity to Rhl cDNA

ncbi A CeDB

1 .5 kb query 0.8 kb query 1 .5 kb query 0.8 kb query

HSP- P d u e H S P r a e P d u c HSP score P value HSP- P 4 u e

The cosmids were selected from the highest scoring cosrnids in the ncbi and ACeDB searches. # - These cosrnids had a P value higher than the threshold (P < 0.05) in the blartn search

with the 0.8 kb probe in ncbi. Underlined P values are higher than the lowest six (in ncbi) or the lowest eight (in ACeDB).

Page 66: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

genome and the opsin cDNA exist, and the eight selected cosmids should contain these regions.

Can these regions hybridize with the cDNA probes under medium or low stringency? This

question will be answered in the next chapter by using modified local similarity comparisons and

hybridization theory.

Page 67: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

Chapter 4

Local Similarity Between Selected C elegans Cosmid

Sequences and the D. melanogarier opsin Rh 1 cDNA

Page 68: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

In Chapter 3, eight cosmids were selected from sequenced cosmids of the C. elegans

genome as having the highest sequence similarities to D. melunoguster Rh 1 opsin cDNA. Do any

of these neighbor the map location of the five cloned inserts which hybridize with the Rhl probes?

Do any of the cosmids neighboring the insert locations have sequence similarity to the probes? Is

there any relationship between sequence similarity and hybridization results? These questions will

be considered in this chapter.

4.1 Identification of sequenced cosmids that may contain the five cloned inserts. *

Since all five phage inserts have been located on the physical maps of C. elegans genome,

#~". 1 searched in ACeDB for sequenced cosmids which neighbor these locations (Fig 4- 1). Among

the cosmids neighboring the five locations, nine have been sequenced. These are identified as

Group 1 or 2 in Table 4- 1 . Of the nine neighboring cosmids, three are included among the eight

cosmids identified in Chapter 3 by their high blastn HSP scores. These are classified as Group 1 :

C37C3, F58A4 and T21B6. As well as neighboring three phage insert locations, they a h have

the first, third and fourth highest blustn HSP score of the eight cosmids (Table 4-1). Of the other

six neighboring cosmids (classified as Group 2), C37A2 and ZK742 were identified by a blastn

search with both query sequences (C37A2) or only the 1.5 kb Rhl query sequence (ZK742), but

f-? wit lowe HSP scores than the ei&t selected in Chapter 3 (see Table 4-1). The other four could

not be found in blustn and fmta searches. This suggests that their blmtn HSP scores with the

Rhl query sequences are very low (HSP score < 100) a

On the other hand, five of the eight cosmids selected for their high blastn HSP scores are

not close to any of the five phage clone locations (Table 4-1). Thus, these Group 3 cosmids could

Page 69: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

Fig. 4-1 The physical map locations of the five phage inserts (BC#S401 -

BC#S405) which hybridize with Rhl probes showing the srnids near

these locations. The bar at the bottom indicates a sequen a region of

the chromosome. Darken region on the bar is the relative location of

the cosmid with the highlighted name Cosmids with the boxed name

have been sequenced.

Page 70: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c
Page 71: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

b

Table 4- 1 : The C. elegans cosrnids selected for the local similarity comparison.

Located near blastn HSP Rank of

Cosmid (1) Chromosome the insert (2) score (3) the Score Group

C37A2 I S40 1 146 9 2

F 1 0E7 I1 178 2 3

c 1 5 ~ 7 n~ s403 < loo &. 2

F58A4 111 S403 176 3 1

C48D5 I11 S404 < 100 2

C54C6 I11 S404 < 100 2

F44B9 I11 163 5 3

R 1 0E4 111 154 8 3

C07G1 IV 155 7 3

C37C3 V S402 193 1 1

ZK742 V S402 13 1 10 2

T21B6 X S405 164 4 1

K09C8 X S405 < 100 2

T05A10 X 162 6 3

1 . Cosmids are selected accordng to their blastn search scores or their chromosomal locations. %

2. Inserts hybridzing with the D. melonogarter Rhl opsin cDNA probes.

3. Identity score of the search with 1.5 kb whole Rhl cDNA as the query.

Page 72: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

not contain the phage inserts which hybridize with the Rhl probes. Can local similarity

comparisons distinguish these cosmids fiom those which do hybridize? I shall test this.

2 Results of the fmta scans and the lfiuta comparisons.

The objective here was to select alignments fiom each cosmid and probe combination

which are most likely to hybridize. How these alignments were analyzed will be discussed in k e

next section. A total of 14 cosmids were processed, the eight with the highest blustn scores and

the nine neighboring the phage insert locations. Three are included in both selections. Their

classification into 3 groups is summarized as follows:

Group 1 : C37C3, F58A4 and T2 lB6. These cosmids neighbor insert locations and have

a high blastn HSP score to the probes. They may contain the sequences which hybridize with

Rhl probes, but this needs to be confirmed.

!-

4'.

Group 2: C37A2,ZK742, C15H7, C48D5, C54C6 and ~ 0 9 k 8 . These also neighbor the

insert locations, but have a low HSP score with the probes. It appears that they may not contain

the hybridizing inserts, this needs to be established.

Group 3: F10E7, F44B9, T05A10, C07G1 and R10E4. These have no relationship to

insert locations, but have HSP scores at the same level as Group 1. They cannot contain the

sequences cloned in phage, thus were not identified in a screen of a genornic library by Southern

hybridization. Can this be explained by analysis of their local alignments? This should be tested.

Page 73: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

fmta scans and &&a comparisons were used to scan each of the 14 cosmid sequences

with both the 1.5 kb and the 0.8 kb Rhl queries. To mimic hybridization, I modified the DNA

score matrix in both fmta and &zsta to disallows gaps, and increased the matching scores of

nucleotides G and C to account for the higher stability of the G-C pair in the hybridized duplex.

The f i a scan provides the one local alignment with highest identity scores (initl and initn scores,

which are the same because gaps are disallowed) for each pairing of probe and cosmid. These

alignments are shown in Figures 4-2 to 4-6. For about half of the cosmids, the best alignments

with the 0.8 kb query were not the same as with the 1.5 kb query.

The matching regions of these fasta alignments, though having high identity, were

generally not long enough for hybrid stability, and often were in the poly A noncoding region of el.

probes. They were not very usefbl for identifjmg the alignments likely to hybridize. It was

therefore much more usefbl to select the alignments among the local similarity alignments

provided by Ifasta. For each cosmid-probe comparison, lfata usually identified more than 20

alignments, on average, which have the initl scores higher than the default threshold. It provides

the percent identity and length of each matching region. From these data I selected the one r alignment most likely to hybridize for each cosmid-probe comparison. The selection was based

on the minimum criteria of nucleic acid hybridization which are discussed in Section 1.4 (also see

next section). I selected the alignment which had the longest matching region with > 45 %

identity and which contained at least two short ( > 8 bp) high-identity ( >70%) regions. Any

alignment with the matching region in the noncoding poly A part of probe was ignored if it was

possible to choose another one with the same level of length and % identity. All these best

alignments are shown in Figures 4-7 to 4- 1 1. Only a few of them are the same ones as the

maximum-score alignments fiom the fasta scan.

Page 74: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

Fig. 4-2: Alignments of fasta scans of the cosrnids of Group 1 (C37C3,

F58A4 and T21B6). Column I, with the 0.8 kb probe; Column 11,

with the 1.5 kb probe. Probe sequences are labeled nina E.

Nucleotide position number on the I .5 kb probe sequence

is 577 higher than for the same nucleotide on the 0.8 kb probe

sequence. d

Page 75: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

1 1 0 110 n l n a L * C U ; C I I C M C U C T . C C A . . . . . ". . . . . . . . . . . . . . . . . . . . c llcl A ~ C C M C D ~ ~ I U J . T ~ : M C A U ~ * C A . ~ _ _ L D U C A

I 1270 3180 3110 1300 3110

Page 76: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

Fig. 4-3: Alignments offasta scans of the cosrnids of Group 2 (C37A2,

ZK742 and C15H7). Column I, with the 0.8 kb probe; Column 11,

with the 1.5 kb probe. Probe sequences are labeled nina E.

Nucleotide position number on the 1.5 kb probe sequence is 577

higher than for the same nucleotide on the 0.8 kb probe sequence.

Page 77: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

1;742 co.mle. C . .I.q.na ( 1 1 5 5 1 nc) 111311 n l l l n l t n : 92 I t : 1 2 o p t : 92 S m I ~ h - l a t . - n ICOC.: 101: 11 .7b 1C.ntIry In 47 nc o r a r l * p

Page 78: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

Fig 4-4: ' Alignments offasfa scans of the cosrnids of Group 2 (C48D5,

C54C6 and K09C8). Column I, with the 0 b probe; Column 11, P with the 1.5 kb probe. Probe sequences are labeled nina E.

Nucleotide position number on the 1.5 kb probe sequence is 577

higher than for the same nucleotide on the 0.8 kb probe sequence.

Page 79: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

0 c 1 c c O I L t C O l L C t 0 0 L c c 0 1 9 c c 0 * 9 c c > . .,-,"'"-CLI'.~.vXT.r._.* - -- - V Y X L % O V Z W 3 t i V 7 X W U W C D I t 3 . . . . . . . . . . . . . . . .. _ . . - . . . . . . . . . T ~ n w z x x r ~ w n v ; 3 z ~ ~ ~ a .u;u

c.1 OC 1 O l l 0 :1 0 0 7 0 1

Page 80: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

Fig. 4-5: Alignments of fmra scans of the cosrnids of Group 3 (F 1OE7,

C F44B9 and TO5A10). Column I, with the 0.8 kb probe; Column 11,

with the 1.5 kb probe. Probe sequences &e labeled nina E.

Nucleotide position number on the 1.5 kb probe sequence

is 577 higher than for the same nucleotide on the 0.8 kb probe

sequence.

Page 81: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

0 7 1 9 C I I * C C C * P I C * C I I * C I I * CII* O L Z * I o s z * I C C I P ~ m ~ > v = v ~ y f v ~ w ~ W ~ - * 3 r T > 2 v ~ > m * m c:.st: . . . . . . . . . . . . . . . . . ' ~ 3 3 L ~ . w - T O w ~ T O T r ? l v 3 w n - ~ w ~ ~ I I _ w C1 . 'CL

. . . . . . . . . : W % - ~ ~ > W ~ W - ~ ; I . ~ W ~ - ~ W ~ Y : W W T ~ - K W ~ W - V V ~ i s . : L r?~w.wn-cccw?n a . ~ i u

r * , : C C * l C t t l C ' I 1 O t I l 0 4 C l ( 1 8 C : 8 oc 8

~ ~ ~ ~ ~ ~ w ~ ~ 3 ~ A 2 1 ~ w L i ; v 3 7 1 1 ~ 1 W L l U 011 O C l 011 0 9 1 C L L C * L

Page 82: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

Fig. 4-6, Alignments offasta scans of the cosrnids of Group 3 (C07G1

and R10E4). Column I, with the 0.8 kb probe; Column II, with the

1 .5 kb probe. Probe sequences are labeled nina E. Nucleotide

position number on the 1 .5 kb probe sequence is 577 higher than

for the same nucleotide on the 0.8 kb probe sequence.

Page 83: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c
Page 84: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

Fig. 4-7: The Ymta alignment which most favors hybridization of the Group 1

cosrnids (C37C3, F58A4 and T21B6). Column I, with the 0.8 kb

probe; Column 11, with the 1.5 kb probe. Probe sequences are labeled

nina E. Nucleotide position number on the 1.5 kb probe sequence is

577 higher than for the same nucleotide on the 0.8 kb probe sequence.

Identity of re'gions boxed is more than 70 %.

Page 85: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

4 7 . 6 a l d * n t l t , 1. 103 n t o v * r l a p r l n l c l 90, o p t : 90 47.61 Id*ncl ty I n 10) nc a a r l a p l l n l c : 90. o p t : 90

IAI n lMoIOO.cxr r n l m I o D U n t r c u r 0 . 1 R n pro- ( 8 x 1 ncl I B I t 5 8 u o i . . a t * f s b r a u n m i a I 11o0o n t ) U l n q u t r 1 . I l l * a,..=

4 8 . 5 t I e m t l c y In 110 n t w o r l r p r l n l c : 111, o p t : 111

350 160 $70 580 3 90 600

4m.Ie I d * n t I t y i n 1 1 1 n C ovarlmp; 1 n l t : 116, apt: 116

4 40 4 SO 4 60 470 490 n l n r I ? A m

n l - C

t 1 l M U-AT 10210 JOlJO >0240 30110 10160 J0270

1 0 0 n l n r I C U C 7 X C A C C C C

nInaC

I : : :::I t 1 I D I GKTCAIACPCCCS

t 2 1 M

10110

Page 86: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

Fig. 4-8: The lfasta alignment which most favors hybridization of the Group 2

cosrnids (C37A2,ZK742 and C15H7). CO~UIIM'I, with the 0.8 kb

probe; Column II, with the 1.5 kb probe. Probe sequences are labelk!

nina E. Nucleotide position number on the 1.5 kb probe sequence is

577 higher than for the same nucleotide on the 0.8 kb probe sequence.

Identity of regions boxed is more than 70 %.

Page 87: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

( 4 1 n l n a o I 1 . m r a i m c WM CORA ( 1 1 ~ -1 ID! 017.2-1.t.c rC17bl co.aLO 116110 n t . ) lvlm u t r l x Ill. & . a t

1o.OB I d . n t ~ t y i n 111 n t oror lap: i n i t : 111, o p t : 111

(A ) nlnu83o.:ac rn lna I COMA Pmrl c u t 0 . 1 W pro- (811 n c l 1Al n l n a u r l . u t r n l n a r q.no cDIIA s q u a w . 11356 ~1

( D l r ~ 7 r l o l . t r . >rx?41 co.mla, C . a l w a n s (11111 n t l 101 a ~ 7 4 1 O l . t ~ ~ .1&742 cosmld 1115•÷# n t ~ u s l n q u c r l x f i l e dna.mc us ing u t r l x I 110 dn..at

49.0) 1Osn:Lrr In 100 nc o ra r la? : I n l c : 17, opr : 17 52.1) IQ.nclcy I n 113 n t ov-clap: I n ~ r : 110, op t : 110

210 160 170 110 190 1400 I410 1420 I410 1440 1450

n l n a I m A - - r r C A X . .-u.r?*Cib:A n1n.c *CCYCICUCGWCCC~~P~;**T

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 11742 4 r u - n ~ ~ ~ ~ -

1 1 l J 1110 1110 1140 a110 a160 ?* I0 7.40 7.30

B

1:o 110 1 I 0 I t 0 n l n a Z A X A ~ C C * ~ * C M U X C X ~ &

n r n l

. . . . . . . . . . . . . . : I : : : 2 : : : x I t 7 4 1 m.i.. 6 . c . u . l l m U * j L T r A m

11742

1170 1180 1190 8420 I020 1010 1040

n l n a l Urn

I R ? ~ l rrrrr * * * T A r r r * c m A n m 8010 I010 IOIO 1100

Page 88: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

Fig. 4-9: The Ifasfa alignment which most favors hybridization of the Group 2

cosmids (C48D5, C54C6 and K09C8). Column I, with the 0.8 kb

probe; Column 11, with the 1.5 kb probe. Probe sequences are labeled

nina E. Nucleotide position number on the 1.5 kb probe sequence is

577 higher than for the same nucleotide on the 0.8 kb probe sequence.

Identity of regions boxed is more than 70 %.

Page 89: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

(A1 nln.*1OO UrC .nln. c ~ A k t 1 N C 0.1 kb Drab. 1121 " t i 1.1 ctaaJo1 .t.t .CIMJ c w o i a (11170 n t ~ U*l"q utr1. Ill. QN..t

99 JI LQonCICy ln 54 nt owrlap: lnlc: 110, opt: 110 18.01 ldontlty In lo1 nt 0r.rl.p; Lnlt: 7 7 , opt: 1 7

( A ) iln..loo.ut bnln. C COMA r8cI cur 0.1 rr, pro- (I11 ntl (A) nLna..ll.trt Pninaz q.n. cDa4 s.qu.nc* (1116 Dp)

(1) R O K ~ O L . ~ ~ ~ .~o*ca co8mld. c. .~.qanr (17410 ntl ( B ) L l K 1 0 1 . ~ ~ bkY*ca coasld, C. .Loqars l l ? o O nc)

umLnq u x l x tlla Ma..: u8lnq u r r l a tile dnr..c

~ 1 . ~ 1 Ldanclty Ln 91 nt overla): Lnlr: 69, Opt: 03 •÷l.lI Ld.ntLty in 91 nt 0r.rl.p; lnlt: 7 6 , opt: 7 5

310 120 310 I110

460 0 0 100 1469 1170 1410 la90

1 1 ~ ~ ~ ~ A ~ A C A C C A - X M 1 1 5 0 . ~ A A m m A U C U ~ I U

. . _ _ _ . . . . . . . . . . . . . . . . . . . . . . .

. . . _ . . . . . . . . . . . . . . . . . . : : : : ... . . . : i ::: : I : :

~o*c( ~ u ~ ~ 4 r ~ ~ - ~ ~ ~ h X A ; i U ~ ^ C T T A r T P C C : ~ '"" 11970 Il9ao ll(90 ~ 1 0 0 0 11010 11013

11170 1)180 11290

n1n.t A + O O . . . . . R O W 1 m 7 - -A

1)OIO ll040 11010 11100 111:o 11110 LIIIJ

Page 90: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

ds

Fig. 4- 10: The lfasta alignment which most favors hybridization of the Group 3

t3 cosrnids (T05A10, R10E4 and C07G1). Column I, with the 0.8 kb

probe; Column II, with the 1.5 kb probe. Probe sequences are labeled

nina E. Nucleotide position number on the 1.5 kb probe sequence is

577 higher than for the same nucleotide on the 0.8 kb probe sequence.

Identity of regions boxed is more than 70 %.

Page 91: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

Comp. rLmn or: ( A 1 n L ~ 8 b 0 0 . e r t 8nln. I COSA P s c r cuc 0 . a L) prob. 1111 n c l

C-r lson of:

I D 1 C O I v l - 1 . m e. aLsq.ns cosaL4 s O 7 q l ( 4 4 2 9 1 n e l ( A 1 n i n . ~ l I . C r t r n l ~ I q l w cDU -. 113% Dpl

us lw u c r l x 111. dna..c 1 1 ) c07q1-1 . t . t <. .~q.nm w s d d c07q1 r.4191 n r l u.1r.q u c r 1 . r i 1 . -.mc

4 4 . 1 1 Id*n'.Lcr I n 101 n c or.rlap: L n l c : 61, opt: 6 1 b r , b l l ~ ~ l l t t c y ~n 114 n c 0v.rl.p: ~ n l c : 11, op t : 12

1 4 0 . 110 160 170 110 I PO 390 600 610 620 610 613 n l n a c P*CC*CCTA--~ ,,,,,.I C U ~ - ~ A ~ ~ ~ ~ K W C I - C A ~

. . . . . . . . . . . . . . . 'fez

: :: : i :: i : :: ::: : : : :: : 1 : : . . . . . . . . . . . . . . - - - - . . C. & - a U C C A - ~ C ~ ~ f C P C P C C T ~ ~ A ' I P U U ~ C I

11160 1117J 111#0 11190 21100 I l l 1 0 La510 11510 11YlO 11340 18510 LY5bJ

110 , . c~c~~cff~~~mxm~r~~ 630 66.3 610 610 600

. . . . . . . . . . . . . . . x c - P U X A C * U - P C P C D P C C I U . P C A ' P n ' f e z M m T

21140 11110 113.0 113a0 11390 11600 18610

Page 92: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

Fig. 4- 1 1 : The &&a alignment which most favors hybridization of the Group 3

cosrnids (F lOE7 and F44B9). Column I, with the 0.8 kb probe; /

Column 11, with the i.5 kb probe. Probe sequences are labeled nina E.

Nucleotide position number on the 1.5 kb probe sequence is 577 * 1 higher than for the same nucleotide on the 0.8 kb probe sequence.

Identity of regions boxed is more than 70 %.

Page 93: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

Comp.rlwon el: (A) nln..Ioo.c~c - ~ L M I cOlA ?st1 cut 0.U bA p r o w (611 nt) (I) f l O ~ ? O ~ . t r c 8f100?. C. .Laqana coamld Ll6196 Dp) ualnq u u t . fI1. dru..c

3L.It La*nclt~ In 9 1 nt or-rlapr Lnlt: 101, opt: lo1

110 140 1 30 nlna I

Page 94: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

4.3 Analysis of the selected lfata alignments.

I The kinetics of hybridization were described in Section 1.4. For hybridization to initiate,

there must be some (at least two) short regions of high ( >70 %) identity. between two involved #

strands. Initiation is followed rapidly by zippering. The stability of the resulting hybridizing 1

region is related to the melting temperature, Tmm, of the duplex. The upper limit of mismatch, P,

is directly related to Tam, and the P can be estimated with the equation quoted in Section 1.4.

Under standard low stringency conditions [ ~ a + ] is assumed to be - 1.0 N, and, if % G + C is

assumed to be 50, then the equation becomes:

500 Tm,(OC) = -98 - - - P ID = hybridizing region length; P = % mismatch]

D \

For a hybridized duplex to be stable at loy stringency, Tmm should be > 40 OC. For this to be

possible, the length, D, of the hybridizing region must be long enough and the % mismatch must

low enough. As explained in Section 1.4, I chose D > 100 bp. Under these conditions, the upper

limit of mismatch, P, should be 53 - 57 depending on the D value.

Thus, in order for an alignment provided by an lfasta comparison to form a stable

hybridizing region, the following criteria must be satisfied:

1) The hybridizing region must include at least two short regions ( > 8 bp) with high similarity ( >

70 % identity) to initiate nucleation, and,

2) The hybrid must include a region longer than 100 bp with less than 53 -57 % mismatch ( > 45

% similarity) for hybrid stability.

Using these criteria, I analyzed, for likelihood of hybridization, the most favorable alignments I

Page 95: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

selected fiom each of'the 28 Ifasta cosmid-probe comparisons.

Each cosmid of Group 1 : C37C3, F58A4 and T2 1B6, had an alignment in the coding

region of the Rhl opsin cDNA which meets all the requirements for triggering hybridiz.tion,and

hybridizing region stability. Therefore, the hybridization is predicted to occur under the

conditions of low stringency between these cosmid sequences and the cDNA probes. Also the

favored alignments of each cosmid with the 1.5 kb and the 0.8 kb probes are exactly same (Fig. 4-

7). This means that these cosmids each contain only one special sequence (most probably the

hybridizing portion of insert) which best matches one part of the cDNA probe.

4

Three Group 2 cosmids are neighbors of Group 1 cosmids: ZK742 with C37C3, C 15H7

with F58A4 and K09C8 with T21B6. But these are predicted not to hybridize with the

neighboring insert sequences: none of the alignments of C 15H7 and K09C8 are long enough @ <

loo), and their best alignments with the 1.5 kb probe are all in noncoding regions of the qDNA .

Also, C15H7's alignment with the 0.8 kb probe does not have a region with high enough identity ,

to trigger the hybridization (Fig. 4-8 and Fig. 4-9). The alignment of ZK742 with the 0.8 kb

probe has only one short highly identical region which would be unlikely to trigger the

hybridization. Its 1.5 kb alignment is located in the 3'- noncoding region of the probe with a lot

of identical poly-A (Fig. 4-8). Thus, it is clear that only the Group 1 cosmids of the cosmid pairs

neighboring S402, S403 and S405 contain the hybridizing region of the insert.

On the other hand, C37A2, the Group 2 cosmid near S401, has the same alignments with

both probes, and both meet the minimum criteria of hybridization (Fig. 4-8).

Page 96: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

Of the pair of Group 2 cosrnids which neighbor S404 (C54C6 and C48D5), C54C6 has

the same alignments with both probes and the alignments meet the minimum criteria. For the 0.8

kb probe alignment of the other (C48D5), the similar region is too short, but interestingly, the 1.5

kb alignment just satisfies the requirements fix,hybridization (Fig. 4-9). Its identical region with I

the 1.5 kb probe is in the 5'- half of the Rhl cDNA sequence, and is not included in the 0.8 kb

probe. Since C48D5 and C54C6 partially overlap, and the S404 clone hybridized with both the

0.8 kb probe and 0.6 kb probe (5'- segment of the cDNA), it is possible that ~ 4 8 ~ 5 contains a

part of S404 sequence that hybridizes with the 1.5 kb Rhl probe under low stringency, but not the

part which hybridizes with the 0.8 kb probe. The hybridizing regions of these two cosmids were

- analyzed, and the result showed that these two regions could not overlap (Fig. 4-9). Thus, it

appears C48D5 is not likely to contain the same S404 insert sequence as C54C6 does. -'<L

Summarizing, among the Group 2 cosmids:

C37A2 and C54C6 each contains a special sequence (most probably the insert) that hybridizes

with both Rhl probes.

C48D5 may hybridize with the 1.5 kb probe, but not the 0.8 kb, and probably does not

contain the same insert as C54C6 does.

ZK742, C 15H7 and K09C8 were not predicted to hybridize and do not contain the

neighboring insert sequences.

- The five members of Group 3 are high-HSP scoring cosmids which are distant fiom the

mapped inserts. Three of these are predicted not to hybridize: the four alignments of TO5AlO , '

and R10E4 are all too short @ < loo), and the alignment of RlOE4 with the 1.5 kb probe is in

the 3'- noncoding poly-A region of the probe. Also none of these four alignments have enough

Page 97: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

identity to trigger and stabilize the hybridization (Fig. 4-10). Similarly, the alignment of CO7Gl

with the 0.8 kb probe could not meet the criterion for duplex stability ( Tarn < 40 OC), and its 1.5

kb alignment does not have high enough identity to trigger the hybridization (Fig. 4-10).

Surprisingly, two cosmids of Group 3 are predicted to hybridize. The 0.8 kb alignments

of FlOEir and F44B9 are too short and have few highly identical regions, but their 1.5 kb

3 alignments favor hybridization. The identical regions with the 1.5 kb probe are all in the 5'-half of

the Rhl sequence. Since these two cosmids are located far fiom the positions of phage inserts on

the maps (F10E7 is even on a different chromosome), they certainly have no relationship with

Group 1 (Fig. 4-1 1). It is possible that these two sequences, predicted to hybridize, were not

identified during screening of the C. elegans genomic library by Southern hybridization (Section 0

1.3).

In summary, of the group 3 cosmids, F10E7 and F44B9 are predicted to hybridize with 1

the 1.5 kb probe 5'- half The other three cosmids have no likelihood of hybridization with the .

- Rhl probes.

4.4 Conclusions -

A total of 14 selected cosmids were analyzed for their likelihood of hybridization by

' selecting alignments fiom data of asmodified Ijasta comparison and applying the minimum criteria

of hybridization. The method correctly predicted that:

Five of the 14 C. elegans cosmid sequences should hybridize with the D. melanogaster opsin

cDNA probes, and the other six should not.

Page 98: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

Five of the eight cosrnids which should hybfidize are identified with the five phage clone

inserts selected by Southern hybridization. Thus their hybridization with the cDNA probes is

confirmed.

Three of the five Group 3 cosmids, which were not expected to hybridize because theu map

locations are distant from the five hybridizing inserts, are predicted not to hybridize by the

criteria applied.

The two cosrnids of Group 3 and one of Group 2, which are predicted to hybridize with the

1.5 kb cDNA probe but do not contain an insert, are an enigma. Inserts containing these

sequences may have been missed by the screening of a C. elegans genomic library by Southern

hybridization.

Page 99: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

Chapter 5

Analysis of Protein Sequence and Structure of 0psin- elated

Genes of C elegans

4

Page 100: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

The nucleotide sequences from 14 C. elegans cosmids have been analyzed for their

similarities to D. melanoguster opsin Rhl cDNA and the possibility of hybridization between

these cosmid sequences and the Rhl D N A probe. I found that the C. elegans genome contains

some regions with a similarity to a fly opsin cDNA which is high enough to trigger hybridization.

Do these regions encode proteins? Do these proteins have the structural features and conserved

identical sequences similar to opsins? If not, what kind of protein are they? And what kind of

relationship do these proteins have to opsin? Are there any opsii or opsin-like proteins among

the sequenced proteins of C. elegans? To answer these questions, I used sequence comparison

tools such as fasta and blusp to search protein sequence databases and compare those sequences

with the amino acid sequence of the Rhl opsin probe.

5.1 Analysis of protein sequences encoded by the cosmids selected for similarity to the

Rhl probe nucleotide sequence.

5.1.1 Results.

Table 5-1 includes eight of the 14 cosmids investigated in the last chapter. The other six

were omitted from the table because of insufficient local sequence similarity to the fly opsin

cDNA probes. Of these eight cosmid sequences predicted to hybridize with Rhl probes, all but

one (C48D5) partly encodes a protein (Table 5-1, Column 4), Figure 5-1 and 5-2 are the Ifasta

comparison of these seven proteins with the Rhl (ninaE) protein sequence. None of these

protein sequences have the conserved regions of opsin or the structure of a seven transmembrane

helixes, so that they are not even G-protein coupled receptors.

Page 101: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

Table 5- 1: Proteins coded by the selected C. elegans cosmid sequences

Best matching cosnfid region

I

Cosrmd (nts &om f i n d ) Correspondinng protein blclstp Most

(Group) 0.8 kb query 1.5 kb query protein in ncbi score

same

same

same

same

12516

-12617

same

23035

-23 149

20056

C37C3.2 Euk aryo tic initiation

(441 a.a.) factor 5 (Rat)

F58A4.l

(296 a.a.)

T21B6.3

(788 a.a.)

C37A2.1

(3 15 a.a.)

Both

noncoding

C54C6.2

(452 a.a.)

F 1 OE7.2 (both)

Cytoplasmic vahe- tRN A

synthe tase (Yeast)

Thrombospondm (Bovine)

p62 ras- GAP associated *

phosphoprotein (Mouse)

Beta tubdm (nematode)

Snnilar to mouse finger

(247 a.a.) protein (clone mkr3) (Human)

F44B9.7 (1.5 kb) E6 protein (Human)

-20142 (400 a.a.)

All above cosrnids contain sequences predicted to hybridize wdh the query cDNA probes

Page 102: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

Fig. 5-1: Results of &zsta comparison between C37C3.2, F58A4.1 and T21B6.3

proteins and Rhl (ninaE) probe protein sequence. All alignments listed in

the figure are the best matching alignments of each comparison. .

Page 103: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

( A ) ninae-aa.txt >nine* sequence (171 aa) ( a ) ~ 3 7 ~ 1 . 2 zc37cJ.2 sequence (441 a.a.1 urlnq matrix CIle BLOSCRISO

18.5t Identity In 178 aa overlap; init: 32, opt: 37

ninae

~ 3 7 ~ 3 .

ninae

c437c3.

ninae

C37C3.

230 240 250 260 270 280 YIPLFLICYSYYFIIMVSMEMKREQAKKnNVKSLRSSEDAEKSAECKLAXvALV-TI

. . . a . . . . v . . . . . . . . . . . . . . . . . . . . . . . . . FIKKF~~X--------- 8CENPEl'QLPRKNNIKS------- KCIUCCCSFDIDLXHKL 100 110 120 130 140

(A) ninae-aa.txt >ninae sequence (373 aa) (8) f58a4.1 >f58a4.1 (296 aa) usinq matrix file BLOSCRISO

18.08 identity in 266 aa overlap; init: 3'4, opt: 38

ninae

f 5884.

ninae

f 58a4.

ninae

t38a4.

ninae

f 58a4.

ninae

f 5884.

. . . . . . X : : . : : . . . . . ... : : X PATVKPS~LSIQ~~VLLLAFIVTASAICYSCGCYQPHSCDGMIPEKFPEA

30 40 5 0 6 0 7 0 80

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . PKCPQPRCXCPCI(CPP~(S~~ISTST~S~PERLTKPYEFPDTP-I~SS----- PS

9 0 LOO 110 120 130 140

(A) ninae-aa.txt >ninae sequence (373 aa) (B)t21b6.3 > t 2 1 b 6 . 3 s e q u e n c e ( 7 8 8 a . a . ) usinq matrix file BLCISUMSO

21.7# identity in 217 aa overlap: init: 33, opt: 45

10 20 30 4 0 5 0 60 . ninae QU;PHFAPLSNGSWDKVTPDWC~(LISPY~QFPL~IDPIW)~KILTAY~IUICMISWCCNC - . . . . . . . . . . . . . . . . . . . a * . . . . . . . . . . . . . .

7 0 80 9 0 100 110 120 ninae W I Y I F A ~ X S L R - T P A N U - - V I N U I S D F C I U I T N T P U - M ~ I N L Y F ~ P M M C D -

Page 104: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

Fig. 5-2: Results of lfasta comparison between C37A2.1, CS4C6.2, F10E7.2

and F44B9.7 proteins and Rhl (ninaE) probe protein sequence. All

alignments listed in the figure are the best matching alignments of

each comparison.

Page 105: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

( A ) n~nae-aa.txt >nlnae sequence ( I 7 1 aa) ( 8 ) cJ7a2.1 2C17A2.1 soquonce (I15 a.a.) uminq matrix .bile BWSUI(50

19.18 identity in 57 aa overlap: nit: 32, opt: 17

330 340 I50 360 ninae CYNP1WGISHPKYRIAI.K--------------- U C P C C V T G W D ~ ~ ~ ~ ~ ~ S O A Q S Q A ... x . .: . . : v . . . . . . . . . . . . . . . . . . . . . . . C37A2. CICI(ILVPIYRHPNrWTIC~PKGATLQru:~CHIYILCR---CSR(D~EA

136 140 150 160 170 180

( A ) ninae-aa.txt sninaa roquenco (373 aa) ( 0 ) c54c6.2 X54C6.2 sequence (452 a.a.1 similar to tubulin using matrix file B2DSU)ISO

22.68 identity,in 93 aa overlap: init: 32, opt: 38

90 100 110 120 130 ninaa TP~VINLAISb--rZ;IMIRITP#K;-IIltYTtRlVLCP~CDIYAGLGS~SSIW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C5 4C6. TDIFITCI DWWYDICIR1ULS)(PTYGDU(HLVSVTIISCVITCLW#;QUA-D~-

200 210 220 230 240 250

140 150 160 170 ninae SMCHISLoRYQVIVKGWAGRPI(TIPIALCKIAY . . . . . . . . . . . . . . . . . . . . . . . . . . C54C6. A V N M V P ? P W ~ P A - - P L S - - - - M w q A Y

260 270 - 200

(A) ninae-aa.txt >ninae sequence (373 as) (B) f10e7.2 >flOe7.2(21609-2375J)weak similarity to a C2H2-ty usinq matrix fils BLOSUNSO

l8.8*, identity in 218 aa overlap: init: 33, opt: 35

40 5 0 6 0 7 0 8 0 inae ~LISPYWNQFP)3(DPIWMILTAY-MIXIGXISWCGNGWIYIFATIXSLRTPWUV . . . x : : :x . . . . . . . . . . . . . . . . . . . . . 10.7. IIPHTICPFCPHIAlTEC~I~SHRLTATDELnEPDKCEQI(Rf---DERU~PIT

10 2 0 30 40 5 0

150 160 170 180 190 ninse LDRYQVIVKG)(AC--RPM-TIPIAL-----GKIAYIWFnSSIWCW)9CWSRYvPECNL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . f 1007. M P Y C Q C S G G a S G G S D P Y C N W M V E D S U W E m W N S K W E D ? T E D W A W

120 130 140 150 160 170

200 210 220 230 ninae TSCGI--DYLmWNPRSYLIFY-SITWYIPLFLICY

(A) ninae-aa.txt >ninaa sequence (373 aa) (B) f44b9.7 >144b9.7(19405-21641)6 exons usinq matrix file BLOs~u50

28.68 identity in 56 a. overlap; nit: 45, opt: 52

. . . . . . . A . . v : . : : . . : . : : : . x . : . . . . . . . . . . . f44b9. U Y Q C K W I S E E L H K K Y M Q F E ) r H K E I ~ L K A U u ) ( I D V P S P G H L K K V D q T K I S

340 350 360 370 380 390

Page 106: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

- by - The ncbi database was searched using blartp with the above seven encoded protein

sequences (Fig. 5-3 and Fig. 5-4). None of the proteins which are most similar to these C.

elegans proteins are G-protein coupled receptors (see Table 4-1).

5.1 .2 Discussion.

The @fa comparisons (Fig. 5-1 to 5-2) showed that the seven proteins have 20 - 30 %

identity of their best matching alignments to the Rhl protein sequence. This is approximately the

same as for the comparison of Rhl protein with C. elegans G-protein coupled receptors (see

Section 5.2). However, these proteins have much lower f&a similarity scores to Rhl protein. '

This is due to the shorter overlapping similar regions and fewer conserved replaceable residues.

The three proteins encoded by Group 1 cosmids (C37C3.2, F58A4.1 and T2'1B6.3) each

have a long similar region with high enough identity, but no feature of a G-protein coupled

receptor structure can be found in the region (Fig. 5-1). Instead, C37C3.2 is very close to rat

eukaryotic initiation factor 5, T2 1B6.3 is close to bovine thrornbospondin, and F58A4.1 has very

weak similarity to yeast cytoplasmic valine tRNA synthetase (Table 5-1). Since these three

cosmids are predicted to hybridize with Rhl cDNA probe, it appears that low stringency

hybridization can occur between two genomic sequences with a high enough local similarity in a

long enough region, even though they encode totally different proteins,

The two proteins encoded by Group 2 cosmids, C37A2.1 and C54C6.2, both have

relatively short identical regions with - 20 % similarity to the Rhl protein sequence. They are

absolutely not G-protein receptors. C54C6.2 is a typical nematode beta tubulin, and C37A2.1 is

Page 107: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

Fig. 5-3 Most similar proteins to C37C3.2, F58A4.1 and T21B6.3 from the

blustp search. All alignments are not shown in the figure. Only the

proteins of species other than C. e l e g m with highest scores in each

blastp search are listed here.

Page 108: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

B'3JT Ye*r:h'Pe3uitY B X f P 1 4 'UP ':6-M*C:h-:iadl ' h i : z I 14 2' 11 w r 1 1046)

Sequen:es produclng Mlqh-scorlng 58qm8nt Palrs-

71 1465137 1U648S71 coeed lor by C e:eqans IDW 1943 2 5.-263 1 spt30'2051 ITS-RAT W K A P Y M : C :VI?IA?:3N ?hCTOR 5 LIT- 274 1 Y+-I31 '

qi1:465837 IU64857) coded for by C. elegans cDNA ykl2fS.3: coded for by C . el8gans cDNA CLESW44T: coded for by C. eleqans cDMA yklOl5. 5: c o d 4 for by C. e1mqar.s cDMA yk'lc3.3; coded lor by C. elegena CONA yklOc3.5; coded for b y C. 8leqans =DNA o l 8 h 6 ; . . . Length - 441

Score - 1949 1087.7 bltsl, Expect - 2 5.-263. P - 2.9.-263 1dentlt:es - 3831441 186'1, P O S I ~ I V ~ S - 383/441 18691

rp )0:2051IT5 RAT N I U R Y O I I C IWI?UTIOW FACMR 5 lE:T-5) plrIlA473'75 r;anslatlon lnitlatlon factor e:T-5 - rat ~ 1 , 2 9 4 5 4 5 iL1:65il eukaryotlc lnltlation factor 5 IPatCus norVeplcua1 h n g t h - 429

Sequences produclng Hlqh-acorlnq Seqrmnt Palrs:

s p 1 P 3 4 4 6 8 l I W I I l - W L L H Y r O T H R I C U 46 1 KO PPOTL:N F5BA4 1 IN C H R O ) * ) S W ::I Lenqch - 435

Score - 517 (237 1 bltsl, Expect - 1.1.-134. S u m P 21 - 1 :a-134 :d8ntlr~ea - 104/171 ( 5 8 8 1 , P o s l r ~ v * s - :04/:" 1 5 8 9 )

Sequences produclnj H l y h - s c o r l ~ g S e p e n t Pa1:s

~ 1 1 1 0 7 0 O c 4 2600111 ?::I36 I [Cae?ozhabdiris a.ega-sl Lenqch '88

Score - 2096 1 9 8 1 6 blcai. Lqpect - 5 0, sum P 4i - 0 3 :dentlr.es - 346/346 llOCII, Pos.r.ves - 146.346 .JO*

Scor. - :90 1 8 9 5 blcsl, Expect - l ?e-45. 9 u a PL'I - 1 Ye-45 Identlcles - 11/80 1 3 8 0 . Porlclvma - 42180 $ 5 2 . 1

31 18l:C3J tX8'6201 thronbospondin [Boa reururl Length - 514

Page 109: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

Fig. 5-4. Most similar proteins to C37A2.1, C54C6.2, F10E7.2 and F44B9.7

From the blustp search. All alignments are not shown in the figure.

Only the proteins of species other than C. elegans with highest scores

in each blasrp search are listed here.

Page 110: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

DatLbaa8

Mon-rmdund~t &dank

CDS

t~~sLatLms+9DD*Su1ssProt*SP~pdata*P~R

266.455 saquancas: 7S.SJS.142 total latt8rs.

Saq'uencas producing Wgh-scorlng SapnC Palrs:

sn1lm.t

s\P

High

Probability

Scora

P(Nl

N

g111943790

(U97194l slular t

o W

-arsociatmd t

.. .

1722

2.1.-235

1

plr1llt9140

p62 ras-W asroclatad phosphoprotal .

..

240

1.4.-25

3

g111943790

(U911941 amlar to C+S-associatad tyrosln8 phosphoprotmin

pb2 ICaanorh.Wltls

ml8gmr)

kngth - 3

15

Scor* - 1

722 (795.7 bitsl, Ixpact - 2

.1.-235,

P - 2

.1.-235

Identiti8s - 31

5/315 11001). Posltivms - 31

5/315 11001) '

plr11149140 p

b2 ras-W associatmd phosphoprotain

- urum g

i1600520

lU1704bl p42 ras-W assoclatad phosphoprocain [)(us

usculus]

Length - 44

3

9

Scora - 24

0 (110.9 bltsl. Expect - 1

.48-25. Sum P(3) - 1.4

.-25

9

Idant1tl8s -

45/125 0611, Posit~vas - 1

91125 IOb)

I-

flC1c7 2111609-237531rcak slmllarlty to a C

2H2-type Llnc flnqcr

1247 lerrersl

Snul1eat

5 un

Hlgh

Probablllty

5eqir-.ccs ~roducrnq Plqh-scorlnq Seqment Fairs:

Score

PIN1

N ~1t!'@tt3C

1U41264I also contalns weak simllarlty to a C2

. 457

1.5e-97

2 ~iri5C4l24 1

L4

't(.

l51

slmllar to House flnqer pzotelnlclone .

..

78

0.018

p~

l

!:C

te3(

1 1V412641 also contalns weak almllarlty ro a C2H2-type Zlnc

1:nqer

(PROSITE POOCOOO281 ICaenorhabdltis eleqansl

Length - i

47

Dacabasa:

Won-radundmt G

onD~k

CDS

tr~slattons+PDB+Su).,Drot+~Pupdata+9lR

267.702 saqu8ncas; 75,961,519 total latt8rs.

Saquancms p

roducing High-scoring S

agunt Pairs:

lullalc

s\P

High

Drobrb~lity

Scora

P(NI

M

scoro - 23

05 (1090.9 birr), Ixpact - 0.

0, P - 0.

0

Idencltios - 45

2/452 (10011, Positivas - 45

2/452 (1001)

pr1111604364A bmta tukrlin [Caanorhabdltis mlagans]

Lmnqth - 44

4

Icorm - 20

35 (937.6 bits), Capact - 0.

0. Sum P(2) - 0

.0

Idantitias - 3

06/30b (loot), F

ositivas - 30

6/30b (1001)

911159159 (KlC492l bota-tubulln IHauonchus c

ontortus)

Lanqth - 4

72

Icorm - 20

02 1922.4 bit.),

Lrpact - 3.

7.-307,

Sum 112) - 3

.78-301

Idontitios - 3

77/304 (9711, Positlvar - 3

85/3Ob iP91I

BLAZT Search Results

BlASTP 1.4 'YP

jZ6-March-lSCf) [Build 14 i1

,-!

Qr

I

.L-Li

Query-

l44b5.711?405-2164116 e?.cn.

1400 lette~rl

Sequences producing Hlqh-srorlng 'epent

Palrs;

aplP3442111YL37-CMLL HYPOTHETICAL 45

5 KD PROTLlN r44BQ 7

lIa5

1 3e-165

3 ql I 904546

1Ul4512l L6 prcteln 1Hwn paplllonu

fa

'J

8:

I

splP3442OIYL37-WLL HYPOTHETICAL 45

5 KD FROTLIN F44B9

1 IN CHPWGSWL

111 plr11S44810 F4409 1 protein

- Cacnorhabdlt~~s

eirqans qll3985e5

IL236481 putatlvc (Caenorhabdltla elegansl

Lenqth - 40

0

Score - 1

199 (540 1

bltsl, Cr.ptcC

- l.3e-185, Sum P13) - 1

3e-185

Idencltles

= 230/242 (Y5CI. P

o~lt~vea - 230/242 155Cl

911484946

1V145121 C6 proteln (nunun paplllonuvlrual

Length - 9

0

Sco~c - 66

129.1 b

ltnl, Expect - 1

I,

P - 0

.82

IdentlLleo - 1

5/68 (220, Poaltlves - 3

2/68 141tI

Page 111: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

close to mouse GAP-associated phosphoprotein (Fig. 5-4, Table 5-1). Of the two proteins

encoded by the Group 3 cosmids which have &ions with high similarity to Rhl cDNA sequence,

F10E7.2 is more similar to Rhl protein, but its rfasta similarity score to the Rhl protein is still

much lower than that of the C. elegans G-protein coupled receptors (see Fig. 5-2, compare with

Fig. 5-5 and Table 5-2) because of much less conserved residues. The other protein, F44B9.7,

has only a short region identical to the Rhl protein, and also does not have a y structural feature

of G-protein coupled receptors (Fig. 5-2). F10E7.2 (similarto mouse finger protein) and

F44B9.7 (similar to human E6 protein) both have o i y very we* similarity to their most similar

proteins (Table 5-1). Though all of the above four proteins are partly encoded by regions which

may hybridize with the whole or 5'-half of Rhl cDNA probe (see Section 4.3), no f a m e of a G-

protein coupled receptor can be found in these sequences. In summary, it appears that nucleotide

sequence similarity at low stringency is usually very different fiom amino acid sequence similarity.

Thus it is apparent that Southern hybridization with a heterologous opsin cDNA probe is very

unlikely to identifjl an opsin gene in the C. elegans genome.

5.2 Searching for the most opsin-like protein sequences of C elegans.

5.2.1 Results.

fasta was used with the amino acid sequence of the Rhl probe to search for similar ?J

sequences in a recent ACeDB protein sequence database. The 20 C.elegans protein sequences

with the highest similarity scores to the probe were selected (Fig. 5-5). Surprisingly, none of

these are encoded by the 14 C. elegans cosmid sequences selected for their similarity to the probe

at the nucleotide sequence level. Also, not one of the cosmids encoding these 20 protein

Page 112: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

Fig. 5-5: Results ofjasta search in C. elegans sequenced protein library with / Rhl amino acid sequence as the query. All alignments are not listed.

---. - '.. Only the first ten sequences and four of the other ten sequences with

sizes between 300 and 400 residues were selected for further analysis.

Page 113: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

FASTA (3.06 sept, 1996) function (optimized, EL50 matrix) ktup: 1

]oln: 43, opt: 31, gap-pen: -12/ -2, width:

32 rea.-scaled

The best scores are:

C52B11.3 CEO4259 C-PROTEIN COUPLED RECEPTO

F47Dl2.2 CEO1946 HUSCARINIC ACETYLCHOLINE

C25C6.5 CEO4086 C-PROTEIN COUPLED RECEPTOR

T27D1.3 CEO1678 C-PROTEIN COUPLED RECEPTOR

C39E6.6 CEO6941 C-PROTEIN COUPLED RECEPTOR

iOlEll.5 CEO7014 C-PROTEIN COUPLED RECEPTO

ZK455.3 CEO3814 C-PROTEIN COUPLED RECEPTOR

C38C10.1 CEO0104 C PROTEIN COUPLED RECEPTO

T07D4.1 CEO2337 C-PROTEIN COUPLED RECEPTOR

F35C8.1 CEO7176 C-PROTEIN COUPLED RECEPTOR

Tl4E8.3 CEO4958 C-PROTEIN COUPLED RECEPTOR

TO5Al.1 CEO3621 C-PROZEIN COUPLED RECEPTOR

C5OP7.1 CEO4239 C-PROTEIN COUPLED RECEPTOR

F56B6.5 CEO4665 C-PROTEIN COUPLED RECEPTOR

rl4D12.6 CEO4395 C-PROTEIN COOPLED RECEPTO

F59C12.2 CEO4683 C-PROTEIN COUPLED RECEPTO

C56Cl.1 CEO4283 C-PR(fiE1N COUPLED RECEPTDR

F02E8.2 CEO7017 C-PROTEIN COUPLED R~?~EPTOR

clec5.i ~~04224

PA~ILY 1 OF C-PROTEIN~COUP

R106.2 CEO7454 G-PROTEIN COUPLED RECEPTOR

d

initn initl opt z-sc

( 692) 323

142

297 339.7

( 245)

239

110

286 332.6

( 455) 210

150

282 324.6

( 340) 223

76

271 313.4

( 457) 198

162

264 303.7

( 488) 324

170

264 303.4

( 444) 293

166

255 293.5

(374) 259

102

244 281.7

( 345) 213

145

238 275.1

( 399) 164

104

238 274.3

(1087) 236

114

241 272.4

( 373)^ 153

153

234 270.1

( 381) 182

142

228 263.0

( 328) 153

120

219 253.4

( 430) 237

99

219 251.9

( 654) 268

105 199 226.5

( 4901

223

152

186 215.3

( 425) 179

116

186 213.8

( 378) 203

94

183 210.9

( 477) 180 144

178 203.9

>>C52B11.3 CEO4259 G-PROTEIN COUPLED RECEPTOR (ST. LOUIS

(692 88)

initn: 323 initl: 142 opt: 297 2-score: 339.7 expect() 30-13

Smith-Waterman score: 297;

26.923% identity in 214 am 0~mrlap

>>F47Dl2.2 CEO1946 WUSCARINIC ACETYLCHOLINE RECEPTOR (ST (245 aa)

inltn: 239 initl: 110 opt: 286 2-score: 332.6 expect() 7.4.-13

Smlth-Water~an score: 286;

24.891% Identity in 229 aa overlap

.>C25C6.5 CEO4086 G-PROTEIN COUPLED RECEPTOR (FAMILY 1)

(455 ma)

initn: 210 Initl: 150 opt: 282 2-score: 324.6 expect() 2.1.-12

Smith-Waterran score: 282;

24.834% identity in 302 aa overlap

>>T27D1.3 CEO1678 C-PROTEIN COUPLED RECEPTOR (CMBRIffiE) .(340

aa)

initn: 223 Initl:

76 opt: 271 2-score: 313.4 axpoct() 8.6.-12

smith-Waterman score: 271;

21.6568 identity in 314 aa overlap

C

>>C39E6.6 CEO6941 C-PROTEIN COUPLED RECE?TOR (ST. LOUIS) (457 as)

Initn: 198 initl: 162 opt: 264 2-score: 303.7 expect() 3.-11

Smith-Waterqan score: 297;

24.5138 identity in 359 aa overlap

>>FOlEll.5 CEO7014 C-PROTEIN COUPLED RECEPTOR (ST. LOUIS

(488 aa)

initn: 324 initl: 170 opt: 264 Z-score: 303.4 expo&()

3.10-11

Smign-Watarran score: 264;

13.7898 identity in 227 aa ovarlap

..ZK455.3

CEO3814 C-PROTEIN COUPLED RECEPTOR (CAICBRIDCE) (444 88)

initn: 293 initl: 166 opt: 255 2-score: 293.5 expect() 1.18-10

Smith-Waterman score: 284;

24.351% Identity in 308 aa overlap

.>C38C10.1 CEO0104 C PROTEIN COUPLED RECEPTOR (CAMBRIDGE (374 aa)

lnitn: 259 initl: 102 opt: 244 2-score: 281.7 exp.ct() 5.10-10

smith-Waterman sfore: 244;

22.7128 identity in 299 as overlap

>>T07D4.1 CEO2337 C-PROTEIN COUPLED RECEPTOR (CAMBRIDCE) (345 aa)

inltn: 213 inltl: 145 opt: 238 Z-score: 275.1 expect() 1.2.-09

Smith-Waterman score: 238;

24.1218 identity in 199 aa overlap

>>F35C8.1 CEO7176 C-PROTEIN COUPLED RECEPTOR (ST. LOUIS)

(399 .a)

initn: 104 initl: 104 opt: 238 2-score: 274.3 expect() 1.30-09

Smith-Waterman score: 238;

22.500% identity in !20

aa overlap

>>T14E8.3 CEO4958 C-PROTEIN COUPLED RECEPTOR (FAMILY 1)

(1087 aa)

inltn: 236 initl: 114 apt: 241 2-score: 272.4 exp.ct() 1.7.-09

Smith-Waterman score: 241;

27.1268 identity in 247 aa overlap

>>TOSAl.l CEO3621 C-PROTEIN COUPLED RECEPTOR (CAKERIDGE) (173-8.)

initn: 153 inirl: 153 opt: 234 2-score: 270.1 expect() 2.2.-09

Smith-Waterman score: 234;

20.6088 identity in 296 8.

overlap

>>CSOF7.1 CEO4239 C-PROTEIN COUPLED RECEPTOR (ST. LOUIS)

(381 aa)

initn: 182 initl: 142 opt: 228 2-score: 263.0 expect() 5.6.-09

Smith-Waterman score: 228;

22.807% identity in 342 aa overlap

. >>F5686.5 CEO4665 C-PROTEIN COUPLED RECEP'IUR (ST. LQUIS)

(328 aa)

initn: 153 initl: 120 opt: 219 2-score: 253.4 expect() 1.90-08

Smith-Waterman score: 219;

21.5058 identity in 279 aa overlap

>>F14012.6 CEO4395 C-PROTEIN COUPLED RCCEPrOR (ST. LOUIS

(430 am)

initn: 237 inltl:

99 opt: 219 2-score: 251.9 expect() 2.30-08

Smith-W8terman score: 248;

25.403% identity in 248 a8 overlap

>>F59C12.2 CEO4683 C-PROTEIN COUPLED RECEPIPR, FMILY 1

(654 88)

initn: 268 initl: 105 opt: 199 2-score: 226.5 expect() 6.-07

Smith-Watarman score: 199;

25.1128 identity in 223 aa overlap

>>C56C3. 1 CEO4283 C-PROTEIN COUPLED RECEPTORS (FAMILY 1) (490 8.)

initn: 223 initl: 152 opt: 188 2-score: 215.3 expect() 2.5.-06

Smith-Waterman score: 188;

30.7148 identity in 140 a8 overlap

>>P02E8.2 CEO7017 C-PROTEIN COUPLED RECEPTOR (ST. LOUIS)

(425 aa)

initn: 179 initl: 116 opt: 186 2-score: 213.8 expect() 3.1.-06

Smith-Waterran score: 186;

23.529% identity In 204 aa overlap

>>C48C5.1 CEO4224 FAMILY 1 OF G-PROTEIN COUPLED RECEPTOR

(378 88)

initn: 203 initl:

94 opt: 183 2-score: 210.9 expct() 4.40-06

in 328 aa overlap

Smith-Waterman score: 199;

21 .I418 identity

>>R106.2 CEO7454 C-PROTEIN COUPLED RECEPTOR

initn: 180 initl: 144 opt: 178 2-score: 203

Smith-Waterman score: 178;

19.9121 identity (ST. LOUIS)

(477 aa)

.9 expect() 1.1.-05

In 226 aa overlap

Page 114: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

sequences locates near any of the five phage inserts in the C. elegans genome (Compare Fig. 5-5,

with Fig. 4-1).

Among those 20 protein sequences, I selected the first ten with the highest initn scores.

Since the four of the second ten sequences have the same size range and some other features of

proteins with seven-transmembrane domains, these were also selected for W e r analysis.

The fasta or lfasta high scoring alignments which had features of seven-transmembrane

proteins with the Rhl probe are shown in Fig. 5-6 to Fig. 5-9. Some important features of these

sequences are noted in Table 5-2. Using blastp with each of these 14 C. elegans protein

sequences as the query, the ncbi database was searched for their most similar proteins (Fig. 5-10

to Fig. 5-13). The proteins most similar to these 14 sequences are listed in Table 5-2, and the

preservation of certain feature; of G-protein coupled receptor are noted. All of these proteins are

similar to G-protein coupled receptors, and 11 of them are most similar to the same subfamily - the peptide receptor family (Table 5-2). Almost all have the typical seven-transmembrane

hydrophobic helix structure and several of the conserved sequence regions (especially in the 5'-

half of the sequence) of the G-protein coupled receptor family. But only one of them may have

the opsin-specific lysine in the seventh transmembrane helix. The 1 1-cis retinal could be bound to

this residue by a Schiff s base linkage. This will be discussed below.

5.2.2 Discussion:

These 14 protein sequences have almost the same structural features and similar conserved

regions, but none appears to be an opsin. In fact,they are more similar to each other than to the

Page 115: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

Table 5-2: Proteins of C.rlegans most similu in amino acid scqumce to the Rb 1 probe Structural terfuer

Normal Wr m Cysteine C r rn K' m hehx VIf most Protein fasta 7- t ranmnmbme N-terminus for disulfide Cterminur for for rstind rimilu blartp

sequence score helix structue glycorylation bidger palrnitovlation binding protein r o t e 2B11.3 297 Yes 2 Bot~men 1 None Dopam~ne 306

Only 6 hclixcs, no hclix VII

hclix VI ir short

Yes

Yes

Yes

Yes

Yes

Yes But E/D.R.Y

become D.R.C

Yes But E/D.R.Y

become E.K.Y. and no NPxxY.

Yes

Yes But hclix VI is short

and no NPxxY Yes

Nonc

None

loop e l and e2, m d m e3

&tmcn loop eland 02.

Bctwen loop eland c2.

Betwcen loop eland e2.

Betmcn loop cland c2.

Bet wten loop e l m d e2. and in e3

Bet w e n loop eland e2.

Bet w e n loop eland e2.

Bet w e n loop, eland c2.

Ektmca loop c l a d 02.

Bet mcn loop eland c2.

Bet w e n loop c l m d e2.

Yes No N-terminus Bctmcn But helix I is very loop eland c2.

short, and the fust residue is in helix I

Yen 2 Bet m n lopp eland c2.

Nonc

1

1

1

None

3 (divided)

4 (divided)

None

1

2 (divided)

1

1

1

None

None

None

None

Nonc

none

None

3

Nonc

None

Nonc

None

None

receptor (Drosophilr) Muacuinic

acetylcholine receptor M4

(Piel Neuopeptide Y receptor 4

( H m m ) Sbstmce P

receptor

c&sbei.nr) NeuopeptiQ

receptor 4 ( H m m )

Octopmine receptor

(Drosophih) W a n in receptor

( k t ) Sbsturcc P

receptor (Rma

catcrbeuna) Mwtype opioid rcccptor ( U t )

Lymnokmin receptor

(Lymnaa s tagd i r )

Neuopeptide Y receptor (Human)

Sbstlnce P rcccptor (Hum-)

SDmatostath receptor

type 5 ( W

G r o ~ h hormon ~ r e t a g o p

receptor type

The furt ten sequence, m d four of the xcond ten sequences i t h the size 300 - 400 unino acid reriducr. m r e xlccted from f u t a results with Rh1 amino acid sequence as the query. The op t imm frrta r o r e ir givm for tbe compuiron in the 2nd column. The most rimilu protein in the ncbi btabasc ir given with the Martp wore. Normal xven-transnnnbranc hclix structlrc includes the E1D.R.Y xqucnce at the bordtr of hclix 111 and loop c2. m d the NPxxY sequence in the helix VII ("xx' can be any unino acid, but wrvlly u c I, V, L and Y).

Page 116: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

Fig. 5-6: Results of fmia or ljbsta comparison between CS2B11.3, F47D12.2,

C25G6.5 and T27D1.3 proteins and Rhl (ninaE) probe protein sequence.

All alignments listed in the figure are thi best matching alignments

of each comparison. Seven-transmembrane helixes are boxed.

Page 117: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c
Page 118: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

Fig. 5-7: Results offasra or lfasta comparison between C39E6.6, F01E11.5,

ZK455.3 and C38C 10.1 proteins and Rhl (ninaE) probe protein sequence.

All alignments listed in the figure are the best matching alignments of

each comparison. Seven-transmembrane helixes are boxed.

Page 119: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

........ .SI*CIO L"1U 1 s . L"ltl 101 apt . I..

I 1 7 4 ..I h l C n - * a I . N I Ker. 1....1..1. 14rncIty 1.1.. .a or.rl.p

Page 120: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

Fig. 5-8: Results o f fasta or lfasta comparison between TO7D4.1, F3 5G8.1,

T05A1.1 and C5OF7.1 proteins and Rhl (ninaE) probe protein sequence

All alignments listed in the figure are the best matching alignments of

each comparison. Seven-transmembrane helixes are boxed.

Page 121: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

. > u t ' r .UUNt-mvmV.nfLLUfhin~?lr :Y) .mT~Lr iL ls i r i i , * 110 i l o I.d 110 I..)

Page 122: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

Fig 5-9 Results o f fasta or lfasiu comparison between F56B6 5 and C48C5 1

proteins and Rh 1 (ninaE) probe protein sequence All alignments listed

in the figure are the best matching alignments o f each comparison

Seven-transmembrane helixes are boxed

Page 123: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

I I0 1 + 0 150 160 170 2 10 n 1 no* L I C l S m I I U Y ~ X ~ L U S L D I L I U I Q ~ ~ C ~ U W ~ N ~ V I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . r 5 4 ~ . ~ i T ; C I ; K ; j c & m u ~ 1 ~ ~ 1 1 1 - - s x a s a s s ~ m ~ ~ ~ - - - - ~ ----

110 1 *O 2 0 0 110 110

rA3TA 1 1 . 0 6 U p t . 1.96) l u ~ t l o n 1 o p t l r l x . d . mL5O n t r 1 1 I )cup. 1 )oln: 4 1 . o p t : 11. q a p - P n : - I ] / - 2 . w id th . 11 r . 9 . - 0 c . l . d In. b.ot mcor*# a r e : i n l t n l n l c l opt cb1c5 I ( 1 7 1 . . . . I I 1 7 1 1 101 94 103

10 I 0 10 t o 50 n 1 no. * U I A V I . * M J U A R U L C I C S V M R ~ I S P I I Y O ~ ? U ( D ? I Y M I L T A m I n

. .

Page 124: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

Fig. 5-10 Most similar proteins to C52B11.3, F47D12.2, C25G6.5 and T27D1.3

From the blustp search. Not all alignments are shown in the figure. Only

the proteins of species other than C. elegans with the highest blasp scores

in each blastp search are listed here. V

Page 125: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

g-

r*

~ sear:?

~tsults

BLASTP

1 4 QnP 126-March-19961 IBulld 14:27.01 Lgr

1 1996) BWT

Search Results

BwTP 1.4 9uP Ilb-nhrch-19951 IDutld 14-27.01 Lgr

1 Lao41

';..

er

r

c51bll 3 lJQ2 a a

I Query-

c25q6.5 I455 a.a.1

1602 Irttersl

I455 letters1

Smllent

3mmlIrst

s lm

•÷

Lam

Wlqh

Probablllry

Mlqh

IcobmblLl~y

W

Sequepcea produclnq Mlqh-acorlnq Seqment Pairs:

Score

PIM)

M

?.'I"9VPO

1U41276) nlmllar to seroconln racepcor .

..

1592

0.0

q.

:*I1555

IU612641 dopane Dl

raceptor D

M [Dr. .

106

6 1.-01

I 1U412151 sllllar to serotonin recaptor and other famlly I

G-proteln coupled receptor8 ICaenorhabdltls elrqansl

Lenqth - 69

2

5~

0r

e - 1

5j2 11681

0 bltsl, Expect - 0

.0. P - 0

.0

l'cncltles

- 6921692 1190?1, Iosltlvrs - 69

2/692 IlOOIl

g.11439565

lU61264) d

oparlne Dl receptor D

M IDrosophlla wlanoqaster]

Lcnqth - 5

17

5-=

re

- 3

05

1141.4 bits), Expect - 6.

le-83, Sum PI51 - 6.

le-I3

!dent~:~rs -

51/191 14991. Posltlves - 7

1/107 l71tt

: q11 1006716

IU41021I sinliar to n*,droprptlde Y

r..

2127 2 7,-299

I spl?501911MY4R~~

MEVROPEITIDE

V RCCLITOI TYPE 4 IMIY4.

. 221

4 9e-I9

4 qll1208810

IU49944l almllar to tamlly 1 o

f t-pr ..

. 249

4.9s-15

4

q111086716 IU410201 mlnllmr to neuropepllde Y receptor type 1 (NYPL-II

and other fully 1 C

-proteln coupled receptors ICaenorhabdltl~ eleqans)

hnqth - 4

55

Score - 21

27 11007.3 bltsl, Expect - 1.

7.-299,

P - 2.1

.-299

Idencltlca - 41

6/455 (91(1,

Posltlvea - 41

6/455 OLII

S~IPSOI~~INY~R-HWMAN

WNROPCITIDC

Y ICCEPTOI tyrc 4

(WPY~-II

OMCIIUTIC P

OLYPLPTIDE MCEPTOR ?

PI) q111063610 IU152121

nouropeptlde Y4 receptor proteln (How saplensl g111101700 l16552J1

pancrmatlc polypmptlde recrptor PPI IHom a

aplensl patlUSI551665112

Smquence

2 fr

om

petenr US 5516651

CW

T 5rarcLI Prsulrs

BLASTP 1.4 9UP 126-March-1996) [Bulld 1

4:27:01

Ap

r 1 19961

Lmnqth - 37

5.

S.~~enrrs p

roduclnq Wlgh-scorlnq Sewnc Pairs:

smalleat

stn

Hlqh

Probablllty

Score

PIN)

N

S~IQ~~%JIITP~I_ULLL

~DOWLC

G PWTEIM-COUPLED MCE~TOR ..

. 1197

i.i.-164

I 1

1 1 1465-7

IV21529) slrllar to l

uscmrlnlc acaty ..

. 429

6.

6 3

~plSI14YIIAOII-PIG

WSCARIWIC ACRYLCIIOLIMC RLCEPIOR MI...

14

5

-4

4

3

splQ9?5611YP42 -EL

PW

WL

E C PMnEIW-CUJPLED lllCCPTOR r47D12.2

qlil22154 IU22011) slml1.r

to m

~scarlnlc acetylchollna receptor,

srrotonln receptor, and other G-proteln coupled receptors

ICarnorhaWLtls e

lrganal

Lenqth - 24

5

q1174L'Ql

lU21529) slmllsr to m

uscarlnlc acetychollne receptor

(Caenorh*WLtls eleqansl

Lenqth - 60

4

5-

qr

r - 4

29 (201 0 blts), Expect - 6.

1.-64,

Sum PI31 - 6.

3.-64

:JCPCIIICS

= 761154 149tl. Iosltlves - 10

1/154 I660

S~IPLI~~II~OII_PIG

WSCARIMIC ACCTYUWOLIME RECEPTOR MI pLrllSOlll4

mscarlnlc mzetylchollnr receptor HZ, qlandular -

plg g111062

txl21lZl muscarlnlc acetylcholine receptor 111 IM

1 -

5901 [Sun

scr>fal pr1111410209A muscarlnlc ACh receptor 111 (Sum acrofa

d.mrs!lcal

Lrnqth - 59

0

Score - 22

3 (105.6 bltml, Expect - 4.

9.-38,

Sum PI41 - 4.9

.-I@

Identltles - 45

/142 0181. Positives - 1

71142 15441

9111200010

1U49944l slmllac to famlly 1 of G-protein coupled receptors

[Caenorhabdltls e

legansl

Length - 45

7

Score - 1

49 1117.9 b

ltm), Expect - 4

9.-15,

Sum ?I0 - 4

9.-I5

Identltles - 4

5/126 11591, Posltlvem - 7

41126 l5nbl

BUST S

earch Results

BLLJTP 1.4.9nP 126-March-1996) [Bulld 14:27:01

Query-

T27Dl.l C

EO1670 G-PROTEIN COUPLED R'CEPTOR

(MIIIDGLI

I149 ma1

1340 letters)

Snullest

3 urn

Sequences p

roduclnq Hlqh-scorlnq Seqment Palm:

. -

Wlqh

Probablllty

Score

PIWI

W

spIQ096301YRL3-CMEL PROWLE C PPOTCIM-CDJPLLD RECEPTOP

. 1692

I 9e-225

1

q111532197

IU677161 substance P receptor Ibna ..

. 110

4 Oe-19

I

splQ096111YR11-CN!EL

PROBABLE G

PROTEIW-COUPLED R

ECEPTOR T27Dl.J

Length - 34

0

Score - 16

02 (761.1 b

ltsl. Eapec,t

1.0.-225.

P - 1.

0.-225

Identitlea - 10

5/140 189I). Poalrivea - l

O5lJ4O I0991

q111532197

1U67736) subatancm P receptar (Ian* catesbelanal

Lenqth - 4O

@

Score - 1

10 06.1 b

lta), Empect - 4.

0.-11.

Sum POI - 4

Oe-11

Identltlea - 2

5/00 OlII, Iosltlven - 4

1/10 1

516)

Page 126: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

Fig. 5-1 1: Most similar proteins to C39E6.6, F01E11.5, Zk455.3 and C38C10.1

fiom the blmtp search. Not all alignments are shown in the figure. Only

the proteins of species other than C. elegans with the highest blmtp scores 4

\ in each blastp search are listed here.

Page 127: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

- :-z - - - - .P - - - - I - ' - 0 '

= 9 4 ""r 9 1 - n - o x .-..a - n r -

0 .. 4 " -

C . C * . 0 P " r r r " - 0 n - a x u n .

n

Page 128: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

s.

Fig. 5-12: Most similar proteins to TO7D4.1, F35G8.1, TO5Al . 1 and C5OF7.1

from the blastp search. Not all alignments are shown in the figure.

Only the proteins of species other than C. elegans with the highest

blastp scores in each blastp search are listed here.

Page 129: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

-40 - I Pn I . 3 , x - r , n o - a . -

Page 130: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

Fig. 5-13: Most similar proteins to F56B6.5 and C48C5.1 fiom the blastp search.

Not all alignments are shown in the figure. Only the proteins of species

other than C. elegans with the highest blastp scores in each blastp

search are listed here.

Page 131: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

B U J T 9e4r-h 7es.~i:s BLAJTP I. 4 3UP (2i-*ac:h-:>li; :SulLd <.27:lL Apt 1 1 3 9 6 ,

auery- f56b6.5 1129 a.a a

1123 ~ e t c e r s r SnulLesr

S .m Hlqh Probab1ll:y

Sequences p r A u c l n q Hlqh-l:grlnq Srqmen'. Palrs: S ~ 3 r e P ' N I v

qllL433511 iU5453?1 slmilar ta f d : ~ 1 o f 5 - ? c . . . 1741 5.6.-244 1 splP309191SSR5-IUT SOWAT'3STAT:H M C L P T O R TYPE 5 ISSSR . . . 193 7.90-11 4 q 1 1 1445961 IU648501 slmllar to somac,scatLn re= 130 2.80-29 3

Score - 1741 1823.7 bltsl, Expect - 6.6.-244. P - 6.6.-244 Idenclcler - 321/326 lL0031. Posltlvel - 32•÷/128 (10011

splP30938l~SP.5~RAT SOMATOSTATIN RLCCPTOR TYPL 5 1335R1 q11409219 ILO45351 s o ~ t o a c a c l n receptor (Rattus norveqlcusl Lenqth - 363

q l l 4 6 5 3 6 1 !U648601 slatlac t o somrtostacln receptors 1CaenorhabdLt11 eleqansl Lenqth - 477

Score - 135 193.7 bltst, Lxpoct - 2.8.-29. Sum Pi31 - 2.8.-29 I d e n t l c ~ o s - 37/119 (3111, P3sl:Lves - 68/1!3 (5731

0 W T Search Results 0LASTP 1.4.9MP iZ6-Yarzlr-1336! [Bull3 14:Z7:0l Apr 1 1 9 9 6 i

Query- c49c5.1 1373 a,..! (373 letters)

Sequencer produclnq Hlqh-scorlnq Seq'fmnt Pairs:

5ealLrrt sum

Hiqh Probablllcy Score PLNi N

qill055:J5 tU19994l S w l a r to I d l y 1 o f G-proteln coupled receptors [Cae n o c h a b d l c ~ s eleqansl Lenqth - 373

Page 132: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

t

Rhl probe protein (see Fig. 5-6 to 5-13, also see Table 5-2).

Nine of these 14 proteins (all except F47D12.2, C38C10.1, TO7D4.1, TO5Al. 1 and

F56B6.5) have the typical structure of the G-protein coupled receptor family. This includes

seven-fransmembrane helixes, glycosylated asparagine(s) in the N-terminus, cysteine disulfide

bridge(s) between loop e l and e2 (there may be another one in loop e3), an ED-R-Y sequence at

the border of helix III and loop c2, and an NPxxY sequence in helix VII (Watson and Arkinstall,

1994). Although typical of the large family, none of these nine proteins has the lysine in their

helix VII sequence for retinal binding in opsins. Five of them may be the receptors in C. elegans

for dopamine, octopamine, galanin, and neuropeptide Y. Of the other four, F35G8.1 (similar to

lyrnnokin receptor), C48C5.1 (similar to growth hormone secretagogue receptor), T27D1.3 and

C50F7.1 (both similar to substance P receptor), ill have low blastp scores. They may possibly be

the same receptors in C. elegans as for their most similar proteins, but this cannot be confirmed

by using only similarities (Table 5-2)

Foui of the remaining five proteins (see Table 5-2) have obvious structure characteristics

of G-protein coupled receptors, but are incomplete. F47D12.2 has only six helixes (I - VI), no

asparagine in the N-terminus for glycosylation, and its helix VI is short (<20 a.a.). C38C10.1 has

no asparagine in the N-terminus too, and its D-R-Y sequence became D-R-C. F56B6.5 begms

with a very short helix 1 and has no N-terminus. TOSAl. 1 has no 'NPxxY" sequence in helix VII

and its helix VI is little bit short. All of those four proteins may be G-protein coupled receptors of

C. elegans with the same or very similar ligands as their most similar proteins (acetylcholine,

substance P, somatostatin and neuropeptide Y) (Table 5-2). None of those four receptors has a

lysine residue in its helix VII.

Page 133: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

4 .

TO7D4.1 is special because it has three lysines in its helix VII. It is possible that any one

of these may bind 1 1-cis retinal as in opsin. But several important, perhaps necessary, residues

are missing fiom the retinal binding pocket, such & an E@) at the border of loop e l and helix III, Q

and an F and W in helix VI. Thus it looks like there is no pocket for retinal binding (Sakmar,

1994; Sakmar, et al., 1989). Also replacing the R residue in the E-R-Y sequence with K will

nearly abolish the G-protein activation (Fahmy and Sakmar, 1993). And T07D4.1 has no NPxxY

sequence, which is found in helix VII of all known G-protein coupled receptors. All these

differences suggest that T07D4.1 would be an atypical G-protein coupled receptor, but may not

be an opsin of C. elegans. Its most similar protein is an opioid receptor of the rat with relatively

low similarity score (Table 5-2, Fig. 5-12).

The above 14 C. elegans proteins, identified by similarity to the Rhl amino acid sequence,

have been shoyn to be all typical G-protein coupled receptors of C. elegans except T07D4.1,

which may be a G-protein coupled receptor-like protein.

5.3 Conclusions.

All the proteins of C. elegans most similar to Drosophila opsin Rhl are G-protein coupled

receptors or a G-protein coupled receptor-like protein, but probably not opsin.

All proteins encoded by C. elegans cosmids selected in Chapter 4 are not even G-protein

coupled receptors.

It appears that there is no opsin in all C. elegans proteins sequenced to date.

Page 134: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

Chapter 6

General Discussion

Page 135: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

The main objective of this thesis was to analyze 14 selected cosmids for their likeliho~d of

hybridization, by selecting alignments generated by a modified Ifata comparison and applying the

minimum criteria fiom hybridization theory Since I correctly identified the cosmids which should

contain hybridizing sequences and located the most likely hybridizing region in these cosmid

sequences, and also correctly identified cosmids expected to have no hybridizing sequence, the

selection and analysis process appears to be successfid Although three cosmids not containing

hybridizing insert sequences were predicted to hybridize, this can be easily explained as due to one

of several common problems with genomic library screening by Southern hybridization. For

example, the library may have not been complete, the stringency of screen may have been too high,

or the clone may have been missed in screening purely by chance

My results appear to confirm that hybridization is related to the local sequence sirmlarity of

the hybridizing region of two involved strands, not the over-all similarity between the two strands.

Also my minimum criteria of hybridization appear to be correct, at least with the probe-target

sequences tested in this thesis. The equation relating melting temperature to D and P, on which the

criteria are based, is an empirical one and was fit to conditions somewhat different fiom the current

application Southern hybridization to a target sequence bound to a filter I expect that the newly-

available, completely-sequenced genomes will allow fbrther refinement of the minimum criteria by

similar tests

Y

My results also demonstrate that a simple sequence similarity search based only on the

existing scoring methods are not reliable for predicting hybridization, even when modified to

mimic the hybridization conditions of no gaps and hgher G-C binding score. The modified farta

program did not correctly select the alignments most likely to hybridize. It simply chose the

Page 136: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

highest (initll initn) scoring alignment of each comparison Most of these alignments did not

satis5 the minimum length criterion of hybridization Thus high similarity score is not equivalent

to high likelihood of hybridization The HSP scores of blastn have the same problem Since two

of Group 2 cosrnids with low HSP scores are predicted to hybridize, but three of Group 3

cosrnids with high HSP scores are not, HSP scores are not a reliable indicator of the likelihood of

hybridization Both the initl and HSP scoring algorithms often favor a short but highly identical

alignment over a longer alignment with lower score that is more likely to hybridize

Since eight regions in the C. elegans genome are similar enough to D, mehogaster opsin

cDNA to hybridize, it is possible that one or more of these regions may code an opsin-like

photoreceptor However the predicted amino acid sequences coded by these eight regions had no

feature even remotely similar to opsin or the G-protein coupled receptor fanlily A blastp search

confirmed that these proteins are most similar to other kinds of protein

I usedjmta to search the C. elegans protein sequence library with the D. melanogaster

opsin protein sequence as the query The 14 most similar protein sequences all resemble G-

protein coupled receptors classified to several different receptor groups All but one of them lack

the retinal covalent binding site, lysine O<) residue in transmembrane helix VII. T07D4.1 has

three lysines in its helix VII, and it is possible that any one of these may bind 1 I-cis retinal But 'L

several important residues are missing from the retinal binding pocket, such as an E@) at the

border of loop e l and helix 111, and an F and W in helix VI Thus it looks like there is no pocket

for retinal binding (Sakmar, 1994; Sakmar, et al , 1989) Further analysis suggests that T07D4 I

would be an atypical G-protein coupled receptor of C. elegans, but may not be an opsin

Page 137: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

The methodology I developed and tested is just a beginning. In my opinion, if the effects

of other conditions of nucleic acid hybridization can all be mimicked by additional computer

algorithms which can be combined with fasta scoring procedures, the computer methods can

entirely mimic a real hybridization process The criteria should be incorporated into a scoring

system If the resulting alignment has these scores higher than a threshold, hybridization d l be

predicted to occur between the two compared nucleotide strands. It should be possible to fine

tune the scoring by adjusting the parameters of fasta, and the additional algorithms.

The results of this thesis not only identifj a method for correctly predicting if two chosen

strands would hybridize, they also demonstrate several ways in which the method would be usehl.

First, the modified lfasta comparison and selection criteria make it easy to identifjl a local region I

and alignment most likely to hybridize Second, they easily identifjl which of the cosmids which

neighbor the genomic locations of an insert contain the hybridizing sequences Third, they also can

help in selecting a probe for hybridization experiments

Computer programs can be used not only as the tools of calculation, data analysis and

information storage, but also as a substitute for time-consuming, expensive experimentation It

requires computer programs which incorporate theoretical and empirical relationships and work

with computer databases This thesis demonstrates one application wtuch can reliably mimic

hybridization experiments

Page 138: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

References:

Altschul, S. F., Gish, W., Miller, W., Myers, E. W. and Lipman D. J. (1990). Basic local alignment search tools. J. Mol. Biol. 215, 403-410.

Archer, S. N., Lythgoe, J. N. and Hall, L. (1992). Rod opsin cDNA sequence tiom the sand goby (Pomatoschistus minutus) compared with those of other vertebrates. Proc. R Soc. Lond B, 19-25.

Bargmann, C. I. and Mori, I. (1 997). Chemotaxis and themotaxis. In C. elegans II. Riddle, D. L,, Blumenthal, T., Meyer, B. J. and Priess, J. R., eds., Cold Spring Harbor Laboratory Press, USA, 7 17-73 7.

Bargmaw C I., Hartwieg, E. and Horvitz, H. R. (1993). o h r a n t - selective genes and neurons mediate olfaction in C. elegam. Cell 74, 5 15-527.

Britten, R. J and Davidson, E. H. (1985). Hybridization strategy. In Nucleic Acid Hybridization - a Practical Approach. Hames, B . D and Higgins, S. J., eds., IRL Press. Limited, UK, 3- 14

Burr, A H (1 985) The photomovement of Caenorhabdttrs elegans, a nematode which lacks ocelh Proof that the response is to light not radiant heating Photochem. Photobrol. JI, 577-582

3

Coulson, A and the C. elegans genome consortium (1997) The C. elegans genome sequencing project I I lh Internaaonal C. elegans Meenng Abstracts, 1 1 3

Denhardt, D. T. ( 1 966). A membrane-filter techmque for the detection of complementary DNA. Biochem. Biophys. Res. Commun. 29, 64 1-646.

Fahmy, K. and Sakmar, T. P. (1993). Regulation of the rhodopsin-transducin interaction by a highly conserved carboxylic acid group. Biochemistry 32, 7229-7236.

Hargrave, P. A. and McDowell, J. H. (1992). Rhodopsin and phototransduction: a model system for G protein-linked receptors. F A S m J. 6, 2323-233 1 .

Iismaa, T. P., Biden, T. J. and Shine, J. ( 1 995). G Protein-coupled Receptors. R. G . Landes Company, USA, 1-43

Page 139: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

Karlin, S. and Altschul S. F. (1993). Applications and statistics for multiple high-scoring segments in molecular sequence. Proc. Nat. A d Sci. USA 90, 5873-5877.

Karlin, S. and Altschul, S. F. (1990). Method for assessing the statistical significance of molecular sequence features by using general score schemes. Proc. Nut. Acad Sci. USA 87,2264-2268.

Lech, K. aInd Brant, R. (1988). Media preparation and bacteriological tools. In Current Protocols in Molecular Biology. Ausubel, F . M., Brant, R., Kingston, R. E., Moore, D. D., Seidman, J. G., Smith, J. A. and Struhl, K., eds. John Wiley & Sons, Inc. USA, 1.1.1- 1.1.6.

Lipman, D. J. and Pearson, W. R. (1985). Rapid and sensitive protein similarity searches. Science 227, 1435-1441.

Moore, D. D. (1996). Commonly used reagents and equipments. In Current Protocols in Molecular Biology. Ausubel, F . M . , Brant, R., Kingston, R. E., Moore, D. D., Seidman, J G , Smith, J. A. and Struhl, K., eds. John Wiley & Sons, Inc. USA, A.2.1-A.2.8.

Nathans, J (1992). Rhodopsin: structure, Fulction and genetics. Biochemistry 31, 4923-493 1

O'Tousa, J. E., Baehr, W., Martin, R. L., Hirsh, J., Pak, W. L. and Applebury, M. L. (1985). The Drosophila ninaE gene encodes an opsin. Cell 40, 839-850.

Oudet, P. and Schatz, C. (1985). Electron microscopic visualisation of nucleic acid hybrids. In Nucleic Acid Hybridization - cp Practical Approach. Hames, B. D. and Higgins, S. J., eds., IRL Press Limited, UK, 16 1- 178.

Pearson, W. R. (1990). Rapid and sensitive comparison with FASTP and FASTA. Methods in Enzymology 183, 63-98.

Pearson, W. R. and Lipman, D. J. (1988). Improved tools for biological sequence comparison. Proc. Nut. Acad Sci. USA 85,2444-2448.

Riddle, D. L., Blumental, T., Mayer, B. J. and Priess, J. R. (1997). Introduction to C. elegans. In C. elegans I / . Riddle, D. L., Blumental, T., Mayer, B. J. and Priess, J. R., eds., Cold Spring Harbor Laboratory Press, USA, 1-22.

Sakrnar, T. P. (1994). Opsins. In Handbook of Receptors and Channels - Gprotein - coupled receptors. Peroutka, S. J., ed., CRC Press Inc., USA, 257-27 1.

Page 140: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

i Sakmar, . P., Franke, R. R. and Khorana, H. G. (1989). Glutarnic acid-1 13 serves as the retinylidene Schiff base counterion in bovine rhodopsin. Proc. Nat. Acad Sci. USA 86,

/ 8309-83 13.

Sengupta, P., Chou, J. C. and Bargmann, C. I. (1996). &-I0 encodes a seven transmembrane domain olfactory receptor required for responses to the odorant diacetyl. Cell 84, 899- 909.

Shuang - Young, X. (1986). A rapid method for preparing phage h DNA fiom agar plate lysates. Gene Anal. Techn. 3, 90-9 1.

d.

Smith, T. F. and waterman, M. S. (1981). Identification of common molecular subsequences. J. Mol. Biol. 147,195-197.

Smith, W. C., Price, D. A., Greenberg, R. M. and Battelle, B. A. (1993). Opsins from the lateral eyes and ocelli.of the horseshoe crab, Limultrspolyphemus. Proc. Nut. Acad Sci. USA 90, 6 150-6 154.

Southern, E, M. (1985). Introduction. In Nucleic Acid Hybridization - a Practical Approach. * Hames, B. D. and Higgins, S. J., eds., IRL Press Limited, UK, 1-2.

Sulston, J. and Hodgkin, J. (1988). Methods. In The Nematode Caenorhabditis elegans Wood, W. B. and the Community of C. elegans Researchers, eds. Cold Spring Harbor Laboratory, USA, 587-606.

Troemel, E. R., Chou, J. H., Dwyer, N. D., Colbert, H. A. and Bargmann, C. I. (1995). Divergent seven transmembrane receptors are candidate chemosensory receptors in C. elegans. Cell 83, 207-2 18.

Waterson, R. H., Sulston, J. E. and Coulson, A. R. (1997). The genome. In C. elegans II. Riddle, D. L., Blumental, T., Mayer, B. J. and Priess, J. R., eds., Cold Spring Harbor Laboratory Press, USA, 23-45.

Watson, S, and Arkinstall, S. (1994). The G-protein linked receptor - Factsbook. Academic Press Inc., USA, 2-294.

Wetmur, J. G. (1991). DNA probes; applications of the principles of nucleic acid hybridization Critical Reviews in Biochemistry and Molecular Biology, 26 (3/4), 227-259

Page 141: Using blast, fasta and hybridization theory to select C. elegans ...summit.sfu.ca/system/files/iritems1/7368/b18736518.pdf · using bl- fasta and hybridization theory to select c

Wood, W. B. (1988). Introduction to C. elegans biology. In The Nematode Caenorhabditis elegans Wood, W. B. and the Community of C. elegans Researchers, eds., Cold Spring Harbor Laboratory, USA, 1 - 16.

Zuker, C. S., Cowman, A. F. and Rubin, G. M. (1985). Isolation and structure of a rhodopsin gene from D. melanogaster. Cell 40, 85 1-858.