information resources in molecular biologybio.lundberg.gu.se/courses/ht06/bio1/infores.pdf · 2011....

44
1 Problem of sequence alignment - interaction between molecular biology / computer science / statistics * What biological problems are addressed ? * Algorithm (dynamic programming) * ‘Simple’ implementation / source code / compiling * Common implementations in molecular biology software packages * Statistics and probability theory of alignments INFORMATION RESOURCES IN MOLECULAR BIOLOGY DNA and protein sequence databases / Genome projects Entrez search tool Protein classification databases PROSITE & others Gene ontology 3D structures (protein, DNA and RNA) OMIM - genetic disorders Taxonomy - classification of organisms PubMed - biomedical articles Most of these databases are primary , i.e. they contain original submissions by experimentalists / content controlled by the submitter Protein classification databases are examples of secondary databases, i.e. they are the result of analysis of data in primary databases / content controlled by third party

Upload: others

Post on 18-Feb-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

  • 1

    Problem of sequence alignment - interaction betweenmolecular biology / computer science / statistics

    * What biological problems are addressed ?* Algorithm (dynamic programming)* ‘Simple’ implementation / source code / compiling * Common implementations in molecular biology software packages* Statistics and probability theory of alignments

    INFORMATION RESOURCES IN MOLECULAR BIOLOGY

    DNA and protein sequence databases / Genome projectsEntrez search tool

    Protein classification databasesPROSITE & othersGene ontology

    3D structures (protein, DNA and RNA)OMIM - genetic disordersTaxonomy - classification of organismsPubMed - biomedical articles

    Most of these databases are primary , i.e. they contain original submissions by experimentalists / content controlled by the submitter

    Protein classification databases are examples of secondary databases, i.e. they are the result of analysis of data in primary databases / content controlled by third party

  • 2

    Common tools :

    Entrez (www.ncbi.nlm.nih.gov)

    SRS (srs.ebi.ac.uk)

    Genome browsers

    • UCSC genome.ucsc.edu• EBI www.ensembl.org • NCBI www.ncbi.nlm.nih.gov

    Margaret Dayhoff

    The early days of sequence databases

  • 3

    First whole genome sequenced 1977

    Bacteriophage phi-x174 (5375 bases) Sanger F et al. The nucleotide sequence of bacteriophage

    phi-X174. Journal of Molecular Biology 125: 225-46

  • 4

    Genome sequencingusing a shotgunapproach

    First genomes of free-living organisms to be sequenced

    1995, TIGR (www.tigr.org)

    Hemophilus influenzae 1.83 MBMycoplasma genitalium 0.58 MB

  • 5

    M ap of Mycoplasma genitalium genome.

    M. genitalium is a ‘minimal’ living organism~ 500 genes organized in a compact genome

    www.tigr.org comprehensive listing of all microorganisms that have been sequenced

  • 6

    Fungi

    Microsporidia

    Insects

    Vertebrates

    Plants

    Stramenopiles

    Trypanosomatids

    AlveolataEukaryotic genomeprojects

    Fungi and protozoa

    Fungi * Saccharomyces cerevisiae* Schizosaccharomyces pombe

    Protozoa

    * Plasmodium falciparum (malaria parasite)* Dictyostelium discoideum

    (Protozoos form a large group of eukaryotes that are usually single-celled and microscopic)

  • 7

    PlantsMB

    Arabidopsis 125Rice (Oryza sativa) 431Populus trichocarpa 520

    PoriforaEumetazoa

    Choanoflagellates

    Metazoa

    Bilateria Cnidaria

    Anthozoa Hydrozoa

    Acoelomata CoelomataPseudo-coelomata

    Reniera

    Nematostella Hydra

    Nematodathornyheadworms

    C. elegans

    flatwormsProtostomia Deuterostomia

    Arthropoda

    Crustacea HexapodaInsects

    Daphnia

    DrosophilaAnophelesApis

    Echinodermata Chordata HemichordataChaetognata

    Strongylocentrotus

    cephalochordata urochordata chraniataBranchiostoma Ciona

    Hyperotreti VertebratesHagfish

    FuguTetraodon

    Lancet Sea squirt

    Sea urchin

    Water flea

    Sea anemone

    Sponges

    Schistosoma

    Xenopus

    Chicken

    Danio

    Opossum

    Cow

    Elephant

    DogMouseRat

    Primates

    Metazoa

    Animal genome projects

  • 8

    Caenorhabditis elegans

    (‘The worm’)Drosophila melanogaster(‘The fly’)

    Animals

    Ciona intestinalis(sea squirt)

  • 9

    Fish

    Zebrafish Danio rerio

  • 10

    The frogXenopus tropicalis

    Chicken (Gallus gallus)

  • 11

    Mus musculus

    Rattus norvegicus

    Mammals

    Dog (Canis familiaris)

    Tasha, female boxer dog

    Broad Institute(formerly WICGR)

    Cow (Bos taurus)

  • 12

    Homo sapiens / Pan troglodytes (chimpanzee)

    ‘Complete’ genome sequences

    Bacteria 296 Archae 23 Eukaryotes

    Fungi ~20Protozoos 2Insects 5Plants 3C. intestinalisFish 5FrogChickenManChimpanzeeOther mammals ~10

  • 13

    MB Genes

    Bacteria 0.6 - 7.5 500-7,000

    S. cerevisiae 12 6,000

    S. pombe 13 6,000

    Worm, Caenorhabditis elegans 97 20,000

    Fly, Drosophila melanogaster 120 14,000

    Plant, Arabidopsis thaliana 110 26,000

    Fish, Fugu rubripes 365 22,000

    Mus musculus 3000 24,000

    H. sapiens 3200 23,000

    5 x 106??~300??*Bacteria 1081018 ?~52 milj ?Insects

    3 x 1096.5 x 10911 Man

    Genome size

    IndividualsSequenced species

    Species

    *One liter of sea water : 10,000-20,000 different bacterial speciesSogin ML, et al Microbial diversity in the deep sea and the underexplored "rare biosphere".Proc Natl Acad Sci U S A. 2006 Aug 8;103(32):12115-20.

    Complexity of biosphereA lot of DNA left to be sequenced ...

  • 14

    Genome data is publically available

    Genomes are available for browsingENSEMBL www.ensembl.orgUCSC genome.ucsc.eduNCBI www.ncbi.nlm.nih.gov

  • 15

    The UCSC genome browser

    This morning the EMBL Database contained 147,059,624,968 nucleotides in 80,603,682 entries

  • 16

    DDBJ (Japan) NCBI, NIH, US Genbank

    EMBL (EBI, UK )

    EMBL and Genbank formats

    EMBL format

    ID LISOD standard; DNA; PRO; 756 BP.XXAC X64011; S78972;XXSV X64011.1XXDT 28-APR-1992 (Rel. 31, Created)DT 30-JUN-1993 (Rel. 36, Last updated, Version 6)XXDE L.ivanovii sod gene for superoxide dismutaseXXKW sod gene; superoxide dismutase.XXOS Listeria ivanoviiOC Bacteria; Firmicutes; Bacillus/Clostridium group;OC Bacillus/Staphylococcus group; Listeria.XXRN [1]RX MEDLINE; 92140371.RA Haas A., Goebel W.;RT "Cloning of a superoxide dismutase gene from Listeria ivanovii byRT functional complementation in Escherichia coli and characterization of theRT gene product.";RL Mol. Gen. Genet. 231:313-322(1992).XXRN [2]RP 1-756RA Kreft J.;RT ;RL Submitted (21-APR-1992) to the EMBL/GenBank/DDBJ databases.RL J. Kreft, Institut f. Mikrobiologie, Universitaet Wuerzburg, Biozentrum AmRL Hubland, 8700 Wuerzburg, FRGXXDR SWISS-PROT; P28763; SODM_LISIV.XX

  • 17

    FH Key Location/QualifiersFHFT source 1..756FT /db_xref="taxon:1638"FT /organism="Listeria ivanovii"FT /strain="ATCC 19119"FT RBS 95..100FT /gene="sod"FT terminator 723..746FT /gene="sod"FT CDS 109..717FT /db_xref="SWISS-PROT:P28763"FT /transl_table=11FT /gene="sod"FT /EC_number="1.15.1.1"FT /product="superoxide dismutase"FT /protein_id="CAA45406.1"FT /translation="MTYELPKLPYTYDALEPNFDKETMEIHYTKHHNIYVTKLNEAVSGFT HAELASKPGEELVANLDSVPEEIRGAVRNHGGGHANHTLFWSSLSPNGGGAPTGNLKAAFT IESEFGTFDEFKEKFNAAAAARFGSGWAWLVVNNGKLEIVSTANQDSPLSEGKTPVLGLFT DVWEHAYYLKFQNRRPEYIDTFWNVINWDERNKRFDAAK"XXSQ Sequence 756 BP; 247 A; 136 C; 151 G; 222 T; 0 other;

    cgttatttaa ggtgttacat agttctatgg aaatagggtc tatacctttc gccttacaat 60gtaatttctt ttcacataaa taataaacaa tccgaggagg aatttttaat gacttacgaa 120ttaccaaaat taccttatac ttatgatgct ttggagccga attttgataa agaaacaatg 180gaaattcact atacaaagca ccacaatatt tatgtaacaa aactaaatga agcagtctca 240ggacacgcag aacttgcaag taaacctggg gaagaattag ttgctaatct agatagcgtt 300

    3.2.4 Feature key examples

    Key Description

    conflict Separate determinations of the "same" sequence differrep_origin Origin of replicationprotein_bind Protein binding site on DNACDS Protein-coding sequence misc_RNA Generic label for an undefined RNAinsertion_seq Insertion elementD-loop Mitochondrial or other D-loop structure

    3.3.4 Qualifier examples

    Key Location/Qualifiers

    CDS 86..742/product="hypoxanthine phosphoribosyltransferase"/label=hprt/note="hprt catalyzes vital steps in thereutilization pathway for purine biosynthesisand its deficiency leads to forms of ""gouty"" arthritis"

    rep.origin 234..243/direction=left

    CDS 109..564/usedin=X10009:catalase

  • 18

    3.5.3 Location examples

    The following is a list of common location descriptors with their meanings:

    Location Description

    467 Points to a single base in the presented sequence

    340..565 Points to a continuous range of bases bounded by and including the starting and ending bases

  • 19

    EMBL and Genbank formats

    EMBL format

    ID LISOD standard; DNA; PRO; 756 BP.XXAC X64011; S78972;XXSV X64011.1XXDT 28-APR-1992 (Rel. 31, Created)DT 30-JUN-1993 (Rel. 36, Last updated, Version 6)XXDE L.ivanovii sod gene for superoxide dismutaseXXKW sod gene; superoxide dismutase.XXOS Listeria ivanoviiOC Bacteria; Firmicutes; Bacillus/Clostridium group;OC Bacillus/Staphylococcus group; Listeria.XXRN [1]RX MEDLINE; 92140371.RA Haas A., Goebel W.;RT "Cloning of a superoxide dismutase gene from Listeria ivanovii byRT functional complementation in Escherichia coli and characterization of theRT gene product.";RL Mol. Gen. Genet. 231:313-322(1992).XXRN [2]RP 1-756RA Kreft J.;RT ;RL Submitted (21-APR-1992) to the EMBL/GenBank/DDBJ databases.RL J. Kreft, Institut f. Mikrobiologie, Universitaet Wuerzburg, Biozentrum AmRL Hubland, 8700 Wuerzburg, FRGXXDR SWISS-PROT; P28763; SODM_LISIV.XX

    Common sequence formats

    1. EMBL release format2. Genbank (ASN.1)3. FASTA format :

    >X12345 Y098TR gene CGTATCTTACGAGCTACTACGAGGTCTTATCGGACGAGCGACT...

  • 20

    HumanMus musculusRodentsOther MammalsOther VertebratesInvertebratesPlantsFungi

    Prokaryotes (+ Archae)

    OrganellesVirusesBacteriophages

    PatentedSynthetic

    Bulk divisionsESTHTGSTSGSS

    EMBL divisions

    Bulk Divisions

    • Expressed Sequence Tag– 1st pass single read cDNA

    • Genome Survey Sequence– 1st pass single read gDNA

    • High Throughput Genomic– incomplete sequences of genomic clones

    • Sequence Tagged Site– PCR-based mapping reagents

    •Batch Submission and htg (email and ftp)•Inaccurate•Poorly Characterized

  • 21

    Two major types of DNA / nucleotide / base sequences found in databases such as GenBank and EMBL:

    * Genomic , arising from sequencing of DNA material isolated from cells

    * ESTs , arising from projects to determine what mRNAs are produced in an certain organism or in a certaintype of cell within a multicellular organism.

    DNA

    mRNA

    EST (Expressed Sequence Tag)

    Expressed Sequence Tags (ESTs) correspond to partial mRNA sequences, they are sequences of cDNA which have been reverse-transcribed from mRNA

    Short sequences (~500-1000 bases), each is result of single sequencing experiment -> high frequency of errors

    Applications:

    1) Used to answer questions like: What genes in a specific cell or tissue are expressed ?

    2) Identification of coding regions in genomicsequences

    3) Discovery of new genes

  • 22

    CDS5’ UTR 3’ UTRmRNA

    public ESTs

    UniGene clusters

    UniGene partitions GenBank sequences into a non-redundant set of gene-oriented clusters. Each UniGene cluster contains sequences that represent a unique gene. A majority of sequences are ESTs.

    The mouse dataset contains 90,970 clusters with a total of 3,560,546 sequences.

    High-Throughput Genomic Sequences

    The High Throughput Genomic (HTG) Sequences division was created to accommodate a growing need to make 'unfinished' genomic sequence data rapidly available to the scientific community. It was done in a coordinated effort between the three International Nucleotide Sequence databases: DDBJ, EMBL, and GenBank. The HTG division contains 'unfinished' DNA sequences generated by the high-throughput sequencing centers. Sequence data in this division are available for BLAST homology searches against either the "htgs" database or the "month" database, which includes all new submissions for the prior month. The HTG division of GenBank was described in a [Genome Research (1997) 7(10)] article by Ouellette and Boguski.

    Location of HTG records:

    Unfinished HTG sequences containing contigs greater than 2 kb are assigned an accession number and deposited in the HTG division. A typical HTG record might consist of all the first pass sequence data generated from a single cosmid, BAC, YAC, or P1 clone which together comprise more than 2 kb and contain one or more gaps. A single accession number is assigned to this collection of sequences and each record includes a clear indication of the status (phase 1 or 2) plus a prominent warning that the sequence data is "unfinished" and may contain errors. The accession number does not change as sequence records are updated; only the most recent version of a HTG record remains in GenBank. 'Finished' HTG sequences (phase 3) retain the same accession number, but are moved into the relevant primary GenBank division. An example of a submission (one accession number) that has progressed through phase 1, phase 2, and phase 3 is available

  • 23

    Genome Survey Sequence (GSS)

    This division is similar in nature to the EST division, except that its sequences will be genomic rather than cDNA (mRNA). The GSS division will contain (but not be limited to) the following types of data:

    - random "single pass read" genome survey sequences- single pass reads from cosmid/BAC/YAC ends- exon trapped genomic sequences- Alu PCR sequences

    STS (Sequence Tagged Sites)

    Sequence Tagged Sites (STS) are short DNA segments with a single location in the genome. This feature of STS makes them useful tags for mapping.

    Predominating divisions: Mammalian genomes - ESTs

  • 24

    FT CDS 109..717FT /db_xref="SWISS-PROT:P28763”FT /transl_table=11FT /gene="sod”FT /EC_number="1.15.1.1”FT /product="superoxide dismutase”FT /protein_id="CAA45406.1”FT /translation="MTYELPKLPYTYDALEPNFDKETMEIHYTKHHNIYVTKLNEAVSGFT HAELASKPGEELVANLDSVPEEIRGAVRNHGGGHANHTLFWSSLSPNGGGAPTGNLKAAFT IESEFGTFDEFKEKFNAAAAARFGSGWAWLVVNNGKLEIVSTANQDSPLSEGKTPVLGLFT DVWEHAYYLKFQNRRPEYIDTFWNVINWDERNKRFDAAK"

    Most entries in protein sequence databases are computational translations from gene sequences

    Three- and one-letter codes of the amino acids.

    Alanine Ala AArginine Arg RAsparagine Asn NAspartate Asp DCysteine Cys CGlutamate Glu EGlutamine Gln QGlycine Gly GHistidine His HIsoleucine Ile ILeucine Leu LLysine Lys KMetionine Met MFenylalanine Phe FProline Pro PSerine Ser STreonine Thr TTryptofan Trp WTyrosine Tyr YValine Val V

  • 25

    Most entries in protein sequence databases are computational translations from gene sequences

    .... we assume that the computational prediction of the protein product is correct ......

    The common protein sequence databases

    * SWISS-PROT

    high-quality annotationnon-redundantcross-referenced to many other databases. Release 50.7 (Sept 2006) of SWISS-PROT contains 232,345 sequence entries

    * TrEMBL

    Computer-annotated supplement to SWISS-PROT. TrEMBL contains the translations of all coding sequences (CDS) present in the EMBL Nucleotide Sequence Database not yet integrated into SWISS-PROT.

    Release 33.7 (Sept 2006) 3,189,332 sequence entries

    (Swissprot & TrEMBL are also part of UniProt, www.ebi.uniprot.org/ )

    * NCBI GenPept (Sept 2006)

    4,007,710 sequence entries

  • 26

    Swiss-Prot annotation: how is biochemical information assigned ?

    (I) Article(s) reports sequencing (nucleic acid and/or amino acid) andbiochemical characterization

    (II) Article(s) reports sequencing and with no biochemical characterization(III) Protein sequence data from translation of genome sequencing data

    Translated sequences searched against Swiss-Prot and TrEMBL. The results give rise to a number of scenarios :

    1. identical to an existing sequence in Swiss-Prot from the same organism,2. identical to an existing sequence in Swiss-Prot from a different

    organism which may or may not be related3. strong similarity (i.e. many residues are conserved residues), over the

    entire sequence, to an existing entry (from a related or differentorganism)

    4. strong similarity only at regions in the sequence (from same, relatedor different organism)

    5. some similarity to one or more existing entries6. no similarity to any existing entries

    Swissprot has cross-references to many other databases

  • 27

    ID PRIO_HUMAN STANDARD; PRT; 253 AA.AC P04156;DT 01-NOV-1986 (REL. 03, CREATED)DT 01-NOV-1986 (REL. 03, LAST SEQUENCE UPDATE)DT 01-NOV-1997 (REL. 35, LAST ANNOTATION UPDATE)DE MAJOR PRION PROTEIN PRECURSOR (PRP) (PRP27-30) (PRP33-35C) (ASCR).GN PRNP.OS HOMO SAPIENS (HUMAN).OC EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA; TETRAPODA; MAMMALIA;OC EUTHERIA; PRIMATES.RN [1]RP SEQUENCE FROM N.A.RX MEDLINE; 86300093.RA KRETZSCHMAR H.A., STOWRING L.E., WESTAWAY D., STUBBLEBINE W.H.,RA PRUSINER S.B., DEARMOND S.J.;RL DNA 5:315-324(1986).RN [2]RP SEQUENCE OF 8-253 FROM N.A.RX MEDLINE; 86261778.RA LIAO Y.-C.J., LEBO R.V., CLAWSON G.A., SMUCKLER E.A.;RL SCIENCE 233:364-367(1986).RN [3]RP VARIANT AMYLOID GSS, SEQUENCE OF 58-85 AND 111-150.RX MEDLINE; 91160504.RA TAGLIAVINI F., PRELLI F., GHISO J., BUGIANI O., SERBAN D.,RA PRUSINER S.B., FARLOW M.R., GHETTI B., FRANGIONE B.;RL EMBO J. 10:513-519(1991).RN [4]RP REVIEW ON VARIANTS.RX MEDLINE; 93372867.RA PALMER M.S., COLLINGE J.;RL HUM. MUTAT. 2:168-173(1993).

    Example of Swissprot entry

    CC -!- FUNCTION: THE FUNCTION OF PRP IS NOT KNOWN. PRP IS ENCODED IN THECC HOST GENOME AND IS EXPRESSED BOTH IN NORMAL AND INFECTED CELLS.CC -!- SUBUNIT: PRP HAS A TENDENCY TO AGGREGATE YIELDING POLYMERS CALLEDCC "RODS".CC -!- SUBCELLULAR LOCATION: ATTACHED TO THE MEMBRANE BY A GPI-ANCHOR.CC -!- DISEASE: PRP IS FOUND IN HIGH QUANTITY IN THE BRAIN OF HUMANS ANDCC ANIMALS INFECTED WITH NEURODEGENERATIVE DISEASES KNOWN ASCC TRANSMISSIBLE SPONGIFORM ENCEPHALOPATHIES OR PRION DISEASES, LIKE:CC CREUTZFELDT-JACOB DISEASE (CJD), GERSTMANN-STRAUSSLER SYNDROMECC (GSS), FATAL FAMILIAL INSOMNIA (FFI) AND KURU IN HUMANS; SCRAPIECC IN SHEEP AND GOAT; BOVINE SPONGIFORM ENCEPHALOPATHY (BSE) INCC CATTLE; TRANSMISSIBLE MINK ENCEPHALOPATHY (TME); CHRONIC WASTINGCC DISEASE (CWD) OF MULE DEER AND ELK; FELINE SPONGIFORMCC ENCEPHALOPATHY (FSE) IN CATS AND EXOTIC UNGULATE ENCEPHALOPATHYCC (EUE) IN NYALA AND GREATER KUDU. THE PRION DISEASES ILLUSTRATECC THREE MANIFESTATIONS OF CNS DEGENERATION: (1) INFECTIOUS (2)CC SPORADIC AND (3) DOMINANTLY INHERITED FORMS. TME, CWD, BSE, FSE,CC EUE ARE ALL THOUGHT TO OCCUR AFTER CONSUMPTION OF PRION-INFECTEDCC FOODSTUFFS.CC -!- DISEASE: CJD OCCURS PRIMARILY AS A SPORADIC DISORDER (1 PERCC MILLION), WHILE 10-15% ARE FAMILIAL. ACCIDENTAL TRANSMISSION OFCC CJD TO HUMANS APPEARS TO BE IATROGENIC (CONTAMINATED HUMAN GROWTHCC HORMONE (HGH), CORNEAL TRANSPLANTATION, ELECTROENCEPHALOGRAPHICCC ELECTRODE IMPLANTATION. . .). EPIDEMIOLOGIC STUDIES HAVE FAILED TOCC IMPLICATE THE INGESTION OF INFECTED ANNIMAL MEAT IN THECC PATHOGENESIS OF CJD IN HUMAN. THE TRIAD OF MICROSCOPIC FEATURESCC THAT CHARACTERIZE THE PRION DISEASES CONSISTS OF (1) SPONGIFORMCC DEGENERATION OF NEURONS, (2) SEVERE ASTROCYTIC GLIOSIS THAT OFTENCC APPEARS TO BE OUT OF PROPORTION TO THE DEGREE OF NERF CELL LOSS,CC AND (3) AMYLOID PLAQUE FORMATION. CJD IS CHARACTERIZED BYCC PROGRESSIVE DEMENTIA AND MYOCLONIC SEIZURES, AFFECTING ADULTS INCC MID-LIFE. SOME PATIENTS PRESENT SLEEP DISORDERS, ABNORMALITIES OFCC HIGH CORTICAL FUNCTION, CEREBELLAR AND CORTICOSPINAL DISTURBANCES.CC THE DISEASE ENDS IN DEATH AFTER A 3-12 MONTHS ILLNESS.CC -!- DISEASE: GSS IS A HETEROGENEOUS DISORDER AND WAS DEFINED AS ACC "SPINOCEREBELLAR ATAXIA WITH DEMENTIA AND PLAQUELIKE DEPOSITS".CC GSS INCIDENCE IS LESS THAN 2 PER 100 MILLION.CC -!- DISEASE: KURU IS TRANSMITTED DURING RITUALISTIC CANNIBALISM, AMONGCC NATIVES OF THE NEW GUINEA HIGHLANDS. PATIENTS EXHIBIT VARIOUSCC MOVEMENT DISORDERS LIKE CEREBELLAR ABNORMALITIES, RIGIDITY OF THECC LIMBS, AND CLONUS. EMOTIONNAL LABILITY IS PRESENT, AND DEMENTIA ISCC CONSPICUOUSLY ABSENT. DEATH USUALLY OCCURS FROM 3 TO 12 MONTHCC AFTER ONSET.CC -!- SIMILARITY: TO OTHER PRP.CC -!- DATABASE: NAME=HotMolecBase; NOTE=PrP entry;CC WWW="http://bioinformatics.weizmann.ac.il/hotmolecbase/entries/prp.htm".

  • 28

    FT SIGNAL 1 22FT CHAIN 23 230 MAJOR PRION PROTEIN.FT PROPEP 231 253 REMOVED IN MATURE FORM (BY SIMILARITY).FT LIPID 230 230 GPI-ANCHOR (BY SIMILARITY).FT CARBOHYD 181 181 PROBABLE.FT CARBOHYD 197 197 PROBABLE.FT DISULFID 179 214 BY SIMILARITY.FT DOMAIN 51 91 5 X 8 AA TANDEM REPEATS OF P-H-G-G-G-W-G-FT Q.FT REPEAT 51 59 1.FT REPEAT 60 67 2.FT REPEAT 68 75 3.FT REPEAT 76 83 4.FT REPEAT 84 91 5.FT VARIANT 102 102 P -> L (IN GSS).FT VARIANT 105 105 P -> L (IN GSS).FT VARIANT 117 117 A -> V (LINKED TO DEVELOPMENT OFFT DEMENTING GSS).FT VARIANT 129 129 M -> V (DETERMINES THE DISEASE PHENOTYPEFT IN PATIENTS WHO HAVE A PRP MUTATION ATFT CODON 178: PATIENTS WITH MET DEVELOP FFI,FT THOSE WITH VAL DEVELOP CJD).FT VARIANT 178 178 D -> N (IN FFI AND CJD).FT VARIANT 180 180 V -> I (IN CJD).FT VARIANT 198 198 F -> S (IN A ATYPICAL FORM OF GSS WITHFT NEUROFIBRILLARY TANGLES).FT VARIANT 200 200 E -> K (IN CJD).FT VARIANT 210 210 V -> I (IN CJD).FT VARIANT 217 217 Q -> R (IN GSS WITH NEUROFIBRILLARYFT TANGLES).FT VARIANT 232 232 M -> R (IN CJD).FT CONFLICT 118 118 MISSING (IN REF. 2).SQ SEQUENCE 253 AA; 27661 MW; FD5373AD CRC32;

    MANLGCWMLV LFVATWSDLG LCKKRPKPGG WNTGGSRYPG QGSPGGNRYP PQGGGGWGQPHGGGWGQPHG GGWGQPHGGG WGQPHGGGWG QGGGTHSQWN KPSKPKTNMK HMAGAAAAGAVVGGLGGYML GSAMSRPIIH FGSDYEDRYY RENMHRYPNQ VYYRPMDEYS NQNNFVHDCVNITIKQHTVT TTTKGENFTE TDVKMMERVV EQMCITQYER ESQAYYQRGS SMVLFSSPPVILLISFLIFL IVG

    //

    Protein classification databases

    PROSITEPfamInterPro

  • 29

    Prosite: Patterns are identified from multiple alignments ofprotein sequences

    Conserved sequence elements in a protein often correspond to specific structural motifs with specific biological function

  • 30

    PROSITE

    Release 16.45 : 1483 patterns

    Example I

    ID ATP_GTP_A; PATTERN.AC PS00017;DT APR-1990 (CREATED); APR-1990 (DATA UPDATE); NOV-1990 (INFO UPDATE).DE ATP/GTP-binding site motif A (P-loop).PA [AG]-x(4)-G-K-[ST].CC /TAXO-RANGE=ABEPV;3D 1EFM; 1ETU; 1Q21; 2Q21; 4Q21; 5Q21; 6Q21;DO PDOC00017;

    [AG]-x(4)-G-K-[ST]

    Example II

    ID ZINC_FINGER_C2H2; PATTERN.AC PS00028;DT APR-1990 (CREATED); JUN-1994 (DATA UPDATE); NOV-1997 (INFO UPDATE).DE Zinc finger, C2H2 type, domain.PA C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H.NR /RELEASE=35,69113;

  • 31

    An typical application of PROSITE:

    A ‘new’ protein sequence has been identified usingbioinformatics methods (like in a genome project). A scan of PROSITE using this sequence (regular expression matching) can give important clues as to the biological function of the protein.

    Pfam

  • 32

    The Pfam database is based on profile analysis of multiple alignments.

    A C D E F G H I K L M N P

    39 -186 -83 -133 -206 366 -144 -257 -134 -257 -195 -21 -135244 -32 -134 -82 -155 59 -144 -104 -83 -104 -93 -123 -83-70 -158 -32 46 -148 -120 -21 -128 17 -69 -1 29 -91

    -301 -202 -401 -301 98 -202 -201 -302 -301 -202 -101 -401 -402-83 -277 182 326 -216 -135 -11 -216 50 -226 -154 9 -73

    -101 -402 199 499 -302 -202 -1 -302 99 -302 -201 -1 -102-62 -186 -101 22 -185 -124 -24 -185 189 -124 -62 -1 -102

    GAQWEERGGLWEEKGAQWEEKGAQWDERAANWEER

    Multiple alignment

    Profile

    Typical application of Pfam

    A ‘new’ protein sequence has been identified usingbioinformatics methods (like in a genome project). A search is carried out where each profile in Pfam is matched to the protein sequence.

    If there is a significant hit to a Pfam entry with known function =>a biological function is assigned to our new protein.

  • 33

    InterPro is an integrated documentation resource for protein families,domains and sites. InterPro combines a number of databases

    that use different methodologies and a varying degree of biological information on well-characterised proteins to derive protein signatures. By uniting the member databases, InterPro capitalises on their individual strengths, producing a powerful integrated diagnostic tool.

    Interpro

  • 34

  • 35

    Databases like InterPro have aided considerably in the annotation of the human genome

    INFORMATION RESOURCES IN MOLECULAR BIOLOGY

    Genome projectsDNA and protein sequence databases

    Entrez search toolProtein classification databases

    PROSITE & othersGene ontology

    3D structures (protein, DNA and RNA)OMIM - genetic disordersTaxonomy - classification of organismsPubMed - biomedical articles

    Most of these databases are primary , i.e. they contain original submissions by experimentalists / content controlled by the submitter

    Protein classification databases are examples of secondary databases, i.e. they are the result of analysis of data in primary databases / content controlled by third party

  • 36

    Gene ontology - a controlled vocabulary

    .... attempts to address the problem that a protein, gene or biological process is not named in a consistent manner in the various information resources available.

    quoted from www.geneontology.org:

    “Biologists currently waste a lot of time and effort in searching for all of the available information about each small area of research. This is hampered further by the wide variations in terminology that may be common usage at any given time, and that inhibit effective searching by computers as well as people. For example, if you were searching for new targets for antibiotics, you might want to find all the gene products that are involved in bacterial protein synthesis, and that have significantly different sequences or structures from those in humans. But if one database describes these molecules as being involved in 'translation', whereas another uses the phrase'protein synthesis', it will be difficult for you - and even harder for a computer - to find functionally equivalent terms. “

  • 37

    Gene ontology* consortium**

    Major principles

    •Molecular function

    •Biological process

    •Cellular component

    * “The subject of ontology is the study of the categories of things that exist or may exist in some domain”

    **“The goal of the Gene OntologyTM Consortium is to produce a dynamic controlled vocabularythat can be applied to all organisms even as knowledge of gene and protein roles in cells is accumulating and changing.”

    Gene ontology www.geneontology.org

  • 38

    Important task for biologists and bioinformaticians:

    - assign gene ontology (GO) term(s) to entriesin molecular databases, for instance , assign GO termsto all entries in the protein sequence databasesor to all entries in InterPro.

    Once we have this sort of information GO becomes highly valuable:

    Example: The microarray technology allows us to measure theexpression of most human genes under a certain experimental condition. Using gene ontology we can ask questions like:

    “In this tumor tissue, what genes are expressed that are related to cell division? “OR “In this tumor tissue, what GO terms are over-represented as compared to normal tissue?”

  • 39

    INFORMATION RESOURCES IN MOLECULAR BIOLOGY

    DNA and protein sequence databases / Genome projectsEntrez search tool

    Protein classification databasesPROSITE & othersGene ontology

    3D structures (protein, DNA and RNA)OMIM - genetic disordersTaxonomy - classification of organismsPubMed - biomedical articles

    Most of these databases are primary , i.e. they contain original submissions by experimentalists / content controlled by the submitter

    Protein classification databases are examples of secondary databases, i.e. they are the result of analysis of data in primary databases / content controlled by third party

    They are all the result of experimental work

    * X ray crystallography* NMR

    Three dimensional structures of proteins,DNA and RNA are collected in the Protein Data Bank (PDB)

  • 40

  • 41

    Example of PDB entry

    HEADER HORMONE 30-OCT-92 1BPH 1BPH 2COMPND INSULIN (CUBIC) IN 0.1M SODIUM SALT SOLUTION AT PH9 1BPH 3SOURCE BOVINE (BOS $TAURUS) PANCREAS 1BPH 4AUTHOR O.GURSKY,J.BADGER,Y.LI,D.L.D.CASPAR 1BPH 5REVDAT 2 31-OCT-93 1BPHA 1 REMARK HET FORMUL 1BPHA 1REVDAT 1 15-JAN-93 1BPH 0 1BPH 6JRNL AUTH O.GURSKY,J.BADGER,Y.LI,D.L.D.CASPAR 1BPH 7JRNL TITL CONFORMATIONAL CHANGES IN CUBIC INSULIN CRYSTALS 1BPH 8JRNL TITL 2 IN THE PH RANGE 7-11 1BPH 9JRNL REF BIOPHYS.J. V. 63 1210 1992 1BPH 10JRNL REFN ASTM BIOJAU US ISSN 0006-3495 030 1BPH 11REMARK 1 1BPH 12REMARK 1 REFERENCE 1

    ATOM 1 N GLY A 1 13.994 47.196 31.798 1.00 35.87 1BPH 129ATOM 2 CA GLY A 1 14.277 46.226 30.708 1.00 38.67 1BPH 130ATOM 3 C GLY A 1 15.574 45.507 31.085 1.00 31.18 1BPH 131ATOM 4 O GLY A 1 16.078 45.660 32.217 1.00 22.60 1BPH 132ATOM 5 N ILE A 2 16.088 44.766 30.126 1.00 28.39 1BPH 133ATOM 6 CA ILE A 2 17.342 44.034 30.404 1.00 23.76 1BPH 134ATOM 7 C ILE A 2 18.526 44.939 30.686 1.00 25.29 1BPH 135ATOM 8 O ILE A 2 19.425 44.457 31.392 1.00 18.74 1BPH 136ATOM 9 CB ILE A 2 17.571 43.072 29.158 1.00 27.36 1BPH 137ATOM 10 CG1 ILE A 2 18.638 42.049 29.605 1.00 18.03 1BPH 138ATOM 11 CG2 ILE A 2 17.859 43.936 27.903 1.00 25.54 1BPH 139ATOM 12 CD1 ILE A 2 18.914 40.930 28.590 1.00 17.07 1BPH 140ATOM 13 N VAL A 3 18.619 46.195 30.192 1.00 24.42 1BPH 141ATOM 14 CA VAL A 3 19.774 47.080 30.436 1.00 30.26 1BPH 142ATOM 15 C VAL A 3 19.952 47.453 31.895 1.00 19.08 1BPH 143ATOM 16 O VAL A 3 21.018 47.421 32.561 1.00 28.15 1BPH 144ATOM 17 CB VAL A 3 19.719 48.274 29.462 1.00 33.87 1BPH 145ATOM 18 CG1 VAL A 3 20.847 49.225 29.754 1.00 30.40 1BPH 146ATOM 19 CG2 VAL A 3 19.868 47.724 28.044 1.00 24.51

  • 42

    3D viewersSeveral programs are available for viewing protein and nucleic 3D structures:

    Rasmol www.umass.edu/microbio/rasmol/

    Weblab www.msi.com

    Kinemage www.cryst.bbk.ac.uk/PPS/vsns-pps/technology/kinemage.html

    Chime www.umass.edu/microbio/rasmol/

    Protein explorer www.umass.edu/microbio/chime/explorer/

    Cn3D www.ncbi.nlm.nih.gov/Entrez

    SwissPDBviewer expasy.proteome.org.au/spdbv/

    (Molscript www.avatar.se/molscript/)

  • 43

    Weblab viewer

  • 44