©cmbi 2001 what are we looking for? data & databases

35
©CMBI 2001 What are we looking for? Data & databases

Post on 19-Dec-2015

223 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: ©CMBI 2001 What are we looking for? Data & databases

©CMBI 2001

What are we looking for?

Data & databases

Page 2: ©CMBI 2001 What are we looking for? Data & databases

©CMBI 2002

Your questions

Lookup

Compare

Predict

Page 3: ©CMBI 2001 What are we looking for? Data & databases

©CMBI 2002

Your questions

Lookup

• Is the gene known for my protein (or vice versa)? • On which chromosome is the gene located?• What sequence patterns are present in my protein? • Are the mutations known which cause this disease?• To what class or family does my protein belong?• What is known about this family?

Page 4: ©CMBI 2001 What are we looking for? Data & databases

©CMBI 2002

Your questions

Compare

• Are there protein sequences in the database which resemble the protein I cloned?

• How can I optimally align the members of this protein family?

• Are these two proteins similar?

Page 5: ©CMBI 2001 What are we looking for? Data & databases

©CMBI 2002

Sequence similarity

MVVSGAPPAL GGGCLGTFTS LLLLASTAIL NAARIPVPPA CGKPQQLNRV VGGEDSTDSEWPWIVSIQKN GTHHCAGSLL TSRWVITAAH CFKDNLNKPY LFSVLLGAWQ LGNPGSRSQKVGVAWVEPHP VYSWKEGACA DIALVRLERS IQFSERVLPI CLPDASIHLP PNTHCWISGWGSIQDGVPLP HPQTLQKLKV PIIDSEVCSH LYWRGAGQGP ITEDMLCAGY LEGERDACLGDSGGPLMCQV DGAWLLAGII SWGEGCAERN RPGVYISLSA HRSWVEKIVQ GVQLRGRAQG

Image, you sequenced this human protein.

You know it is a serine protease.Which residues belong to the active site?Is its sequence similar to the mouse serine protease?

Page 6: ©CMBI 2001 What are we looking for? Data & databases

©CMBI 2002

Alignment

MVVSGAPPAL GGGCLGTFTS LLLLASTAIL NAARIPVPPA CGKPQQLNRV VGGEDSTDSEMMISRPPPAL GGDQFSILIL LVLLTSTAPI SAATIRVSPD CGKPQQLNRI VGGEDSMDAQ*::* .**** **. :. : *:**:*** : .** * *.* *********: ****** *:: WPWIVSIQKN GTHHCAGSLL TSRWVITAAH CFKDNLNKPY LFSVLLGAWQ LGNPGSRSQKWPWIVSILKN GSHHCAGSLL TNRWVVTAAH CFKSNMDKPS LFSVLLGAWK LGSPGPRSQK******* ** *:******** *.***:**** ***.*::** *********: **.**.**** VGVAWVEPHP VYSWKEGACA DIALVRLERS IQFSERVLPI CLPDASIHLP PNTHCWISGWVGIAWVLPHP RYSWKEGTHA DIALVRLEHS IQFSERILPI CLPDSSVRLP PKTDCWIAGW**:*** *** ******: * ********:* ******:*** ****:*::** *:*.***:** GSIQDGVPLP HPQTLQKLKV PIIDSEVCSH LYWRGAGQGP ITEDMLCAGY LEGERDACLGGSIQDGVPLP HPQTLQKLKV PIIDSELCKS LYWRGAGQEA ITEGMLCAGY LEGERDACLG********** ********** ******:*. ******** . ***.****** ********** DSGGPLMCQV DGAWLLAGII SWGEGCAERN RPGVYISLSA HRSWVEKIVQ GVQLRGRAQGDSGGPLMCQV DDHWLLTGII SWGEGCAD-D RPGVYTSLLA HRSWVQRIVQ GVQLRG----********** *. ***:*** *******: : ***** ** * *****::*** ******

=> Transfer of information

Page 7: ©CMBI 2001 What are we looking for? Data & databases

©CMBI 2000 J Leunissen

Are these structures similar?

Page 8: ©CMBI 2001 What are we looking for? Data & databases

©CMBI 2002

Your questions

Predict

• Can I predict the active site residues of this enzyme?• Why are these patients ill?• Can I make a 3D model for my protein?• Can I predict a (better) drug for this target?• How can I improve the thermostability of this protein?

(protein engineering)• How can I predict the genes located on this genome?

Page 9: ©CMBI 2001 What are we looking for? Data & databases

©CMBI 2002

How to find the answers to these questions?

Outline

Morning• Data in databases

Afternoon• Programs (tools) to search these databases• Knowledge how to search the databases with these

tools (hands-on)

Page 10: ©CMBI 2001 What are we looking for? Data & databases

©CMBI 2002

Biological Databases

The number of databases- DBCAT currently lists over 500 databases

The size of databases- Grows exponentially- EMBL database: New entries entered at 6.3 sec/seq! (July 2001)

Page 11: ©CMBI 2001 What are we looking for? Data & databases

©CMBI 2001 J Leunissen

(July 2001)

Page 12: ©CMBI 2001 What are we looking for? Data & databases

©CMBI 2002

Primary and Secondary Databases

Primary databases REAL EXPERIMENTAL DATA

Biomolecular sequences or structures and associated annotation information (organism, function, mutation linked to disease, functional/structural patterns, bibliographic etc.)

Secondary databases

DERIVED INFORMATIONFruits of analyses of sequences in the primary sources (patterns, blocks, profiles etc. which represent the most conserved features of multiple alignments)

Page 13: ©CMBI 2001 What are we looking for? Data & databases

©CMBI 2002

Primary Databases

Sequence Information– DNA: EMBL, Genbank, DDBJ– Protein: SwissProt, TREMBL, PIR, OWL

Genome Information– GDB, MGD, ACeDB, ENSEMBL

Structure Information– PDB, NDB, CCDB/CSD

Page 14: ©CMBI 2001 What are we looking for? Data & databases

©CMBI 2002

Secondary Databases

Sequence-related Information– ProSite, REBase

Genome-related Information– OMIM, TransFac

Structure-related Information– DSSP, HSSP, FSSP, PDBFinder

Pathway Information– KEGG, Pathways

Function-related– Enzyme, GO

Page 15: ©CMBI 2001 What are we looking for? Data & databases

©CMBI 2002

Databases

Data must be in certain format for the programs to recognize them.

Every database can have its own format, but some data elements are essential for every database:

1. Unique identifier, or accession code2. Name of depositor3. Literature references4. Deposition date5. The real data

Page 16: ©CMBI 2001 What are we looking for? Data & databases

©CMBI 2002

3 examples

1. SwissProt2. EMBL3. PDB

Page 17: ©CMBI 2001 What are we looking for? Data & databases

©CMBI 2002

Quality of databases

SwissProt

• Data is only entered by annotation experts

EMBL, PDB

• Everybody can submit data• Data are accepted the way they are submitted

Page 18: ©CMBI 2001 What are we looking for? Data & databases

©CMBI 2002

SwissProt database

• Database of protein sequences• Produced by Amos Bairoch (University of Geneva) and the

EMBL Data Library• Data derived from:

– translations of DNA sequences (from EMBL Database)– adapted from the PIR collection– extracted from the literature – and directly submitted by researchers

• SwissProt & SwissNew• July 2001:

– ~86,600 entries, ~15,000 new entries / year– Swissnew: 53,000 entries

• Ca. 200 Annotation experts worldwide• Keyword-organised flatfile

Page 19: ©CMBI 2001 What are we looking for? Data & databases

©CMBI 2002

SwissProt records (1)

ID identification line

ID ENTRY_NAME DATA_CLASS; MOLECULE_TYPE; SEQUENCE_LENGTH.ID CRAM_CRAAB STANDARD; PRT; 46 AA.

Format for the ENTRY_NAME: NAME_SPECIES ( 10 characters)

For number of organisms (16) SPECIES has a recognizable name:

HUMAN, MOUSE, CHICK, BOVIN, YEAST, ECOLI….

N.B. The ID can change, e.g. serotonine receptors have got a new nomenclature

Page 20: ©CMBI 2001 What are we looking for? Data & databases

©CMBI 2002

SwissProt records (2)

AC accession numberAC P01542;AC is unique:Name, sequence, everything can change but AC stays the same

DT deposition dateDT 21-JUL-1986 (Rel. 01, Created)DT 30-MAY-2000 (Rel. 39, Last sequence update)DT 30-MAY-2000 (Rel. 39, Last annotation update)1) You can not see what the last annotation update was2) No depositor record (Implicit: author of first reference)

Page 21: ©CMBI 2001 What are we looking for? Data & databases

©CMBI 2002

SwissProt records (3)

DE descriptionDE CRAMBIN.DE 6-phosphofructo-2-kinase 1 (EC 2.7.1.105) (Phosphofructokinase 2 I)1) General descriptive information2) Free-format

GN gene name GN THI2.

OS & OC & OGOS Crambe abyssinica (Abyssinian crambe).OC Eukaryota; Viridiplantae; Embryophyta;Tracheophyta;Spermatophyta;OC Magnoliophyta; eudicotyledons; Rosidae; eurosids II; Brassicales;OC Brassicaceae; Crambe.Organism Species; Organism Classification; OrGanelle

Page 22: ©CMBI 2001 What are we looking for? Data & databases

©CMBI 2002

SwissProt records (4)

RN ReferencesRN [1]RP SEQUENCE.RX MEDLINE; 82046542.RA Teeter M.M., Mazer J.A., L'Italien J.J.;RT "Primary structure of the hydrophobic plant protein crambin.";RL Biochemistry 20:5437-5443(1981).

CC Comments or notesCC -!- FUNCTION: THE FUNCTION OF THIS HYDROPHOBIC PLANT SEED PROTEINCC IS NOT KNOWN.CC -!- MISCELLANEOUS: TWO ISOFORMS EXISTS, A MAJOR FORM PL (SHOWN HERE)CC AND A MINOR FORM SI.CC -!- SIMILARITY: BELONGS TO THE PLANT THIONIN FAMILY.

Page 23: ©CMBI 2001 What are we looking for? Data & databases

©CMBI 2002

SwissProt records (5)

DR Database Cross ReferenceDR PIR; A01805; KECX.DR PDB; 1CRN; 16-APR-87.DR PDB; 1CBN; 31-JAN-94.DR PDB; 1CCM; 31-OCT-93.DR PDB; 1CCN; 31-JAN-94.DR PDB; 1CNR; 31-AUG-94.DR PDB; 1AB1; 12-AUG-97.DR INTERPRO; IPR001010; -.DR PFAM; PF00321; plant_thionins; 1.DR PRINTS; PR00287; THIONIN.DR PROSITE; PS00271; THIONIN; 1.

KW KeywordNot standardized (under control of depositor)KW Thionin; 3D-structure.

Page 24: ©CMBI 2001 What are we looking for? Data & databases

©CMBI 2002

SwissProt records (6)

FT Feature table data

FT DISULFID 3 40FT DISULFID 4 32FT DISULFID 16 26FT VARIANT 22 22 P -> S (IN ISOFORM SI).FT VARIANT 25 25 L -> I (IN ISOFORM SI).FT STRAND 2 3FT HELIX 7 16FT TURN 17 19FT HELIX 23 30FT TURN 31 31FT STRAND 33 34FT TURN 42 43

Page 25: ©CMBI 2001 What are we looking for? Data & databases

©CMBI 2002

Feature table

Other features: post-translational modifications, binding sites, enzyme active sites, local secondary structure or other characteristics reported in the cited references. Sequence conflicts between references are also included.

FT CONFLICT 33 33 MISSING (IN REF. 2).FT MUTAGEN 123 123 G->R,L,M: DNA BINDING LOST.FT MOD_RES 11 11 PHOSPHORYLATION (BY PKC).FT LIPID 1 1 MYRISTATE.FT CARBOHYD 103 103 GLUCOSYLGALACTOSE.FT METAL 87 87 COPPER (POTENTIAL).FT BINDING 14 14 HEME (COVALENT).FT PROPEP 27 28 ACTIVATION PEPTIDE. FT DOMAIN 22 788 EXTRACELLULAR (POTENTIAL).FT ACT_SITE 193 193 ACCEPTS A PROTON DURING CATALYSIS.

Page 26: ©CMBI 2001 What are we looking for? Data & databases

©CMBI 2002

SwissProt records (7)

SQ sequence headerSQ SEQUENCE 46 AA; 4736 MW; 919E68AF159EF722 CRC64;

Sequence data TTCCPSIVAR SNFNVCRLPG TPEALCATYT GCIIIPGATC PGDYAN

//Termination line

Page 27: ©CMBI 2001 What are we looking for? Data & databases

©CMBI 2002

EMBL database

• Nucleotide database• EMBL & EMNEW• July 2001:

• EMBL: 3,951,820 entries, EMNEW: 323,703• EMEST*: 8,092,600, EMNEWEST*: 619,777

*) EMEST/EMNEWEST = EST-section of EMBL, EST = expressed sequence tag

• EMBL records follows roughly same scheme as SwissProt

• Obligatory deposit of sequence in EMBL (or SwissProt) before publication

Page 28: ©CMBI 2001 What are we looking for? Data & databases

©CMBI 2002

Protein Data Bank (PDB)

• Databank for macromolecular structure data (3-dimensional coordinates)

• Obligatory deposit of coordinates in the PDB before publication

• ~16,000 entries (October 2001)• PDB file is a keyword-organised flat-file (80 column)

1) human readable2) every line starts with a keyword (3-6 letters)3) platform independent

• Started ca. 25 years ago (on punche cards!)

Page 29: ©CMBI 2001 What are we looking for? Data & databases

©CMBI 2002

PDB records (1)

Filename= accession number= PDB Code1) Filename is 4 positions (often 1 digit & 3 letters, e.g. 1CRN)2) Be aware: 0HYK means entry HYK does not contain coordinates

HEADERdescribes molecule & gives deposition dateHEADER PLANT SEED PROTEIN 30-APR-81 1CRN 1CRND 1

CMPNDname of moleculeCOMPND CRAMBIN 1CRN 4

SOURCEorganismSOURCE ABYSSINIAN CABBAGE (CRAMBE ABYSSINICA) SEED 1CRN 5

Page 30: ©CMBI 2001 What are we looking for? Data & databases

©CMBI 2002

PDB records (2)

AUTHORAUTHOR W.A.HENDRICKSON,M.M.TEETER 1CRN 6

The depositor

JRNLJRNL AUTH M.BLABER,X.-J.ZHANG,B.W.MATTHEWS 111L 10JRNL TITL STRUCTURAL BASIS OF ALPHA-HELIX PROPENSITY AT TWO 111L 11JRNL TITL 2 SITES IN T4 LYSOZYME 111L 12JRNL REF SCIENCE V. 260 1637 1993 111L 13JRNL REFN ASTM SCIEAS US ISSN 0036-8075 038 111L 14

REMARK Not standardized: many different REMARK records & subrecords!REMARK 1 REFERENCE 3 1CRNC 10REMARK 1 AUTH M.M.TEETER,W.A.HENDRICKSON 1CRN 16REMARK 1 TITL HIGHLY ORDERED CRYSTALS OF THE PLANT SEED PROTEIN 1CRN 17REMARK 1 TITL 2 CRAMBIN 1CRN 18REMARK 1 REF J.MOL.BIOL. V. 127 219 1979 1CRN 19REMARK 1 REFN ASTM JMOBAK UK ISSN 0022-2836 070 1CRN 20REMARK 2 1CRN 21REMARK 2 RESOLUTION. 1.5 ANGSTROMS. 1CRN 22

Page 31: ©CMBI 2001 What are we looking for? Data & databases

©CMBI 2002

PDB records (3)

SEQRESSequence of protein;Be aware: Not always all 3D-coordinates are present for all the amino acids in SEQRES!!SEQRES 1 46 THR THR CYS CYS PRO SER ILE VAL ALA ARG SER ASN PHE 1CRN 51SEQRES 2 46 ASN VAL CYS ARG LEU PRO GLY THR PRO GLU ALA ILE CYS 1CRN 52SEQRES 3 46 ALA THR TYR THR GLY CYS ILE ILE ILE PRO GLY ALA THR 1CRN 53SEQRES 4 46 CYS PRO GLY ASP TYR ALA ASN 1CRN 54

HET & FORMULmetals, cofactors, ions, etc. HET NAD A 1 44 NAD CO-ENZYME 4MDH 219HET SUL A 2 5 SULFATE 4MDH 220HET NAD B 1 44 NAD CO-ENZYME 4MDH 221HET SUL B 2 5 SULFATE 4MDH 222FORMUL 3 NAD 2(C21 H28 N7 O14 P2) 4MDH 223FORMUL 4 SUL 2(O4 S1) 4MDH 224FORMUL 5 HOH *471(H2 O1) 4MDH 225

Page 32: ©CMBI 2001 What are we looking for? Data & databases

©CMBI 2002

PDB records (4)

HELIX/SHEET/TURNSecondary structure elements as provided by the crystallographer (subjective)HELIX 1 H1 ILE 7 PRO 19 1 3/10 CONFORMATION RES 17,19 1CRN 55SHEET 2 S1 2 CYS 32 ILE 35 -1 1CRN 58TURN 1 T1 PRO 41 TYR 44 1CRN 59

SSBONDdisulfide bridgesSSBOND 1 CYS 3 CYS 40 1CRN 60SSBOND 2 CYS 4 CYS 32 1CRN 61

CRYST1, ORIGX1, ORIGX2, ORIGX3, SCALE1, SCALE2, SCALE3crystallographic parametersCRYST1 40.960 18.650 22.520 90.00 90.77 90.00 P 21 2 1CRN 63ORIGX1 1.000000 0.000000 0.000000 0.00000 1CRN 64ORIGX2 0.000000 1.000000 0.000000 0.00000 1CRN 65ORIGX3 0.000000 0.000000 1.000000 0.00000 1CRN 66SCALE1 .024414 0.000000 -.000328 0.00000 1CRN 67SCALE2 0.000000 .053619 0.000000 0.00000 1CRN 68SCALE3 0.000000 0.000000 .044409 0.00000 1CRN 69

Page 33: ©CMBI 2001 What are we looking for? Data & databases

©CMBI 2002

PDB records (5)

ATOMone line for each atom with its unique name and its x,y,z coordinatesATOM 1 N THR 1 17.047 14.099 3.625 1.00 13.79 1CRN 70ATOM 2 CA THR 1 16.967 12.784 4.338 1.00 10.80 1CRN 71ATOM 3 C THR 1 15.685 12.755 5.133 1.00 9.19 1CRN 72ATOM 4 O THR 1 15.268 13.825 5.594 1.00 9.85 1CRN 73ATOM 5 CB THR 1 18.170 12.703 5.337 1.00 13.02 1CRN 74ATOM 6 OG1 THR 1 19.334 12.829 4.463 1.00 15.06 1CRN 75ATOM 7 CG2 THR 1 18.150 11.546 6.304 1.00 14.23 1CRN 76ATOM 8 N THR 2 15.115 11.555 5.265 1.00 7.81 1CRN 77ATOM 9 CA THR 2 13.856 11.469 6.066 1.00 8.31 1CRN 78ATOM 10 C THR 2 14.164 10.785 7.379 1.00 5.80 1CRN 79ATOM 11 O THR 2 14.993 9.862 7.443 1.00 6.94 1CRN 80

TER record terminates the amino acid chainATOM 325 OD1 ASN 46 11.982 4.849 15.886 1.00 11.00 1CRN 394ATOM 326 ND2 ASN 46 13.407 3.298 15.015 1.00 10.32 1CRN 395ATOM 327 OXT ASN 46 12.703 4.973 10.746 1.00 7.86 1CRN 396TER 328 ASN 46 1CRN 397

Page 34: ©CMBI 2001 What are we looking for? Data & databases

©CMBI 2002

PDB records (6)

HETATMatomic coordinate records for atoms within “HET & FORMUL”-lines (metals, cofactors, ions, …) and for water moleculesHETATM 5158 AP NAD B 1 42.641 30.361 41.284 1.00 26.73 4MDH5495HETATM 5159 AO1 NAD B 1 43.440 31.570 40.868 1.00 20.69 4MDH5496HETATM 5160 AO2 NAD B 1 41.161 30.484 41.376 1.00 33.73 4MDH5497

HETATM 5207 O HOH 0 15.379 1.907 3.295 1.00 58.12 4MDH5544HETATM 5208 O HOH 1 58.861 0.984 17.024 1.00 37.58 4MDH5545HETATM 5209 O HOH 2 24.384 1.184 74.398 1.00 35.92 4MDH5546

Page 35: ©CMBI 2001 What are we looking for? Data & databases

©CMBI 2002