protein predic+on i for computer scien+sts · and it illustrated the insights to be gained from...

39
PP1CS SoSe 17 Protein Predic+on I for Computer Scien+sts Proteins May 18/23th, Summer Term 2017 Burkhard Rost & Lothar Richter

Upload: others

Post on 27-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Protein Predic+on I for Computer Scien+sts · and it illustrated the insights to be gained from ordering the data in this way. Several other groups have also classified the known

PP1CS SoSe 17

ProteinPredic+onIforComputerScien+sts

ProteinsMay18/23th,SummerTerm2017

BurkhardRost&LotharRichter

Page 2: Protein Predic+on I for Computer Scien+sts · and it illustrated the insights to be gained from ordering the data in this way. Several other groups have also classified the known

PP1CS SoSe 17

Lectureandexercise●  hGps://www.rostlab.org/teaching/ss17/pp1cs●  Announcements,slidesandvideos●  LectureTuesdays(10:00-11:30am)andThursdays(10:00–11:30am)

●  RoomMW1801(MechanicalEngineering)●  ExerciseThursdays12:30–14.00pmRoomHörsaal3(MI00.06.011,Lecturehall3)andmostlyMW2250onTuesday13-15

●  RegisterforthelectureandexaminTUMonline

Page 3: Protein Predic+on I for Computer Scien+sts · and it illustrated the insights to be gained from ordering the data in this way. Several other groups have also classified the known

PP1CS SoSe 17

Exercise

●  ExercisewikihGps://i12r-studfilesrv.informa+k.tu-muenchen.de/sose17/pp4cs1/index.php/Main_Page

Page 4: Protein Predic+on I for Computer Scien+sts · and it illustrated the insights to be gained from ordering the data in this way. Several other groups have also classified the known

PP1CS SoSe 17

Exercise–TopicsandScheduleSlot Thursday Tuesday Topic 1 May 4th May 9th Structure of the Exercise / Biological Background 2 May 11th May 16th Biological background 3 May 18th May 23rd Protein structures 4 Jun 1st Jun 6th Resources for Biological Information / Formats 5 Jun 8th Jun 13th Alignments 6 Jun 22nd Jun 27th Machine Learning incl. Tricks / Secondary

Structure Prediction 7 Jun 29th Jul 4th Homology Modeling / Prediction of Other Protein

Features 8 Jul 6th Wrap Up – Questions

WED Jul 12th EXAM

Page 5: Protein Predic+on I for Computer Scien+sts · and it illustrated the insights to be gained from ordering the data in this way. Several other groups have also classified the known

PP1CS SoSe 17

Chirality(~Handedness)

Page 6: Protein Predic+on I for Computer Scien+sts · and it illustrated the insights to be gained from ordering the data in this way. Several other groups have also classified the known

PP1CS SoSe 17

AminoAcidsfrom https://en.wikipedia.org/wiki/Amino_acid

•  Schema of an α-amino acid

Page 7: Protein Predic+on I for Computer Scien+sts · and it illustrated the insights to be gained from ordering the data in this way. Several other groups have also classified the known

PP1CS SoSe 17

AminoAcids

https://en.wikipedia.org/wiki/File:Amino_Acids.svg

Page 8: Protein Predic+on I for Computer Scien+sts · and it illustrated the insights to be gained from ordering the data in this way. Several other groups have also classified the known

PP1CS SoSe 17

AminoAcids

https://en.wikipedia.org/wiki/File:Amino_Acids.svg

Page 9: Protein Predic+on I for Computer Scien+sts · and it illustrated the insights to be gained from ordering the data in this way. Several other groups have also classified the known

PP1CS SoSe 17

AminoAcids

Essential for humans: phenylalanine, valine, threonine, tryptophan, methionine, leucine, isoleucine, lysine, and histidine https://en.wikipedia.org/wiki/File:Amino_Acids.svg

Page 10: Protein Predic+on I for Computer Scien+sts · and it illustrated the insights to be gained from ordering the data in this way. Several other groups have also classified the known

PP1CS SoSe 17

ProteinSequence/PrimaryStructure●  linearsequenceofaminoacids●  orientedfromN-toC-terminus

●  (typically)alwaysstartswithMethionin

●  IMPORTANT:considerthedifferentmeaningsCoding/Representation Protein Aspects

1D-information: sequence of amino acids as string

Primary structure: amino acid sequence

2D-information: 2D-array, contact map

Secondary structure: secondary structure elements like helices or sheets,...

3D-information: coordinates or atom couplings

Tertiary structure: spatial arrangement of secondary structure elements (incl. amino acids, atoms, ...)

Page 11: Protein Predic+on I for Computer Scien+sts · and it illustrated the insights to be gained from ordering the data in this way. Several other groups have also classified the known

PP1CS SoSe 17

SecondaryStructure

●  localstructuralelements●  structuralbuildingblocksforhigherorderstructures

●  α-helix,β-sheet,loops

●  stabilizedbyhydrogenbonds

●  aminoacidshavepreferencesforcertainsecondarystructureelements

Page 12: Protein Predic+on I for Computer Scien+sts · and it illustrated the insights to be gained from ordering the data in this way. Several other groups have also classified the known

PP1CS SoSe 17

Ter+aryStructure

●  spa+alarrangementofallsecondarystructureelementsofaprotein

●  alterna+vearrangementscanexist(conforma+onchangesuponsubstratebinding,orinducedfit)

●  canbyusedtohierarchicalorganizefoundproteinstructures

Page 13: Protein Predic+on I for Computer Scien+sts · and it illustrated the insights to be gained from ordering the data in this way. Several other groups have also classified the known

PP1CS SoSe 17

QuarternaryStructure●  forma+onofmul+-proteincomplexes●  manycellularprocessesarecarriedoutbymul+-proteincomplexes:

●  especiallyforhighlycoordinated/regulatedac+onslike:-  replica+on-  transcrip+on-  transla+on

●  difficulttodetermineprecisely,some+mevisiblealreadyinEM

Page 14: Protein Predic+on I for Computer Scien+sts · and it illustrated the insights to be gained from ordering the data in this way. Several other groups have also classified the known

PP1CS SoSe 17

HydrogenBondsareWeakInterac+ons

●  HydrogenBondsareweak:-  Donor:H-atomwithpar+alposi+vechargeinpolarbond(O,N)

-  Acceptor:atomwithunboundelectronpairs(O,N)andtypicallypar+alnega+vecharge

-  distance:160-200pm-  1-5kcal/mol

https://en.wikipedia.org/wiki/Hydrogen_bond#/media/File:Base_pair_GC.svg

Page 15: Protein Predic+on I for Computer Scien+sts · and it illustrated the insights to be gained from ordering the data in this way. Several other groups have also classified the known

PP1CS SoSe 17

AlphaHelix●  3.6aminoacidsperturn,spiralforming●  4-40residue(mostly10)

●  stabilizedbyhydrogenbondsbetweenbackboneatoms:

https://en.wikipedia.org/wiki/Alpha_helix

Page 16: Protein Predic+on I for Computer Scien+sts · and it illustrated the insights to be gained from ordering the data in this way. Several other groups have also classified the known

PP1CS SoSe 17

BetaStrand/Sheet●  “longrange(intermsofinvolvedresidues)”hydrogenbonds

●  paralleloran+parallel●  “flat”

https://en.wikipedia.org/wiki/Beta_sheet

Page 17: Protein Predic+on I for Computer Scien+sts · and it illustrated the insights to be gained from ordering the data in this way. Several other groups have also classified the known

PP1CS SoSe 17

Loop/Turn/Coil

●  generally:changeofdirec+on:alpha(4),beta(3),gamma(2),delta(1),pi(5)

●  omega-loop:catchallterm,includeslongerstretches,nohydrogenbondinginvolved

●  connectorbetweenbeGerdefinedsecondarystructureelementsorattheendofapolypep+dechain

Page 18: Protein Predic+on I for Computer Scien+sts · and it illustrated the insights to be gained from ordering the data in this way. Several other groups have also classified the known

PP1CS SoSe 17

RandomCoil/DisorderedRegion

●  noclearsecondarystructureelementsiden+fiable

●  likesta+s+caldistribu+onofshapes●  biologically:canbeusedasadaptertodifferenttargetshapes,i.e.oneconforma+onisstabilizeduponinterac+onwithapartner

Page 19: Protein Predic+on I for Computer Scien+sts · and it illustrated the insights to be gained from ordering the data in this way. Several other groups have also classified the known

PP1CS SoSe 17

ProteinFeatures

●  surfacearea●  hydrophobicity

●  size

●  iso-electricpoint●  aminoacidcomposi+on

●  variousfunc+onalorstructuralmo+fs

Page 20: Protein Predic+on I for Computer Scien+sts · and it illustrated the insights to be gained from ordering the data in this way. Several other groups have also classified the known

PP1CS SoSe 17

RamachandranPlot

●  Doublebondnatureofthepep+dbond●  ϖ:0or180°,notfreelyrotatable

●  φ,ψ:dependsonthespecificaminoacidsandthespecificcontext

●  therearetypicalrangesforhelicesandsheets

Φ

Ψ ϖ

Page 21: Protein Predic+on I for Computer Scien+sts · and it illustrated the insights to be gained from ordering the data in this way. Several other groups have also classified the known

PP1CS SoSe 17

RamachandranPlotfrom https://en.wikipedia.org/wiki/Ramachandran_plot

180 ̊

-180 ̊

0 ̊

0 ̊-180 ̊ 180 ̊

right handed α-helix

left handed α-helix

β-sheet rigid/ fixed radius relaxed radius

Page 22: Protein Predic+on I for Computer Scien+sts · and it illustrated the insights to be gained from ordering the data in this way. Several other groups have also classified the known

PP1CS SoSe 17

Classifica+onofStructures:CATH/SCOP●  cameupinthemiddleofthe1990s●  botharequitesimilar

●  aim:organizetheproteinstructuresavailableinPDB,basedonsingledomains

●  hierarchicalsystem(roughly):-  secondarystructurecontent-  fold-  superfamilies-  families

Page 23: Protein Predic+on I for Computer Scien+sts · and it illustrated the insights to be gained from ordering the data in this way. Several other groups have also classified the known

PP1CS SoSe 17

SCOP:aStructuralClassifica+onofProteins

●  Murzin,A.,Brenner,S.E.,Hubbard,T.J.P.andChothia,C.(1995)J.Mol.Biol.,247,536-540

●  Hubbard,T.P.,Murzin,A.,Brenner,S.E.andChothia,C.(1997),Nucl.AcidsRes.25(1),236-239(easiertoobtain)

●  fullymanuallycurated,drivenbyexpertanalysis

●  associatedwiththeASTRALcompendium

●  latestnews:SCOPe(UCBerkeley),SCOP2(MRCLabMolBiol,Cambridge,UK)

Page 24: Protein Predic+on I for Computer Scien+sts · and it illustrated the insights to be gained from ordering the data in this way. Several other groups have also classified the known

PP1CS SoSe 17

CATH●  semi-automa+cprocedureforderivinganovelhierarchicalclassifica+onofproteindomainstructures

●  fourmainlevels:-  C:proteinclass,mainlysecondarystructurecomposi+onofeachdomain

-  A:architecture,summarizesshapesbasedonorienta+onofsecondarystructureelements

-  T:topology,sequen+alconnec+vityisconsidered-  H:homologoussuperfamily,highsimilaritywithsimilarfunc+ons,evolu+onaryrela+onshipassumed

Page 25: Protein Predic+on I for Computer Scien+sts · and it illustrated the insights to be gained from ordering the data in this way. Several other groups have also classified the known

PP1CS SoSe 17

some nine highly populated families (‘superfolds’ [1]),with important implications for prediction algorithms,and it illustrated the insights to be gained from orderingthe data in this way.

Several other groups have also classified the known struc-tures, focusing on a variety of local and global topologi-cal features and employing a range of algorithms (struc-ture comparison algorithms and classification generallyare reviewed in [13–16]). The SCOP database, developedby Murzin et al. [17], groups proteins having significantsequence similarity into homologous families, whereasmore distant structural similarities are largely identifiedmanually. This database places emphasis on evolutionaryrelationships and information from the literature relatingto well-studied fold families is also incorporated (e.g. the βtrefoils [18] and the OB fold [19]). By contrast Holm andSander, use the structure comparison algorithm DALI torecognise structural neighbours, whether motif or foldbased, without formally ordering proteins in the PDB intofamilies [20]. The ENTREZ database of Hogue et al. [21],uses a similar approach to DALI, listing neighbours by afast vector-based comparison algorithm (VAST).

The task of defining structural relationships is furthercomplicated by the existence of multidomain proteins;more than 30% of non-identical structures in the currentPDB contain two or more domains. A number of domainrecognition algorithms have appeared recently to address

this problem [22–26]. The 3Dee database of Siddiquiand Barton (http://snail.biop.ox.ac.uk:8080/3Dee) sepa-rates the constituent folds of multidomain proteins usingthe DOMAK algorithm. Similarly, Sowdhaminini et al.have constructed a database of single domain families[27], using the domain recognition algorithm DIAL [26]and the structural comparison procedure SEA [28]. Bothdatabases contain data that is generated largely automati-cally, but is subsequently checked and where appropri-ate reordered manually.

In recognition of the need to regularly maintain and updatedata on structural relatives, we have further developed ourautomatic procedures for identifying and classifying struc-tural families [6] to construct a database of single-domainfold families. Any multidomain proteins are first dividedinto their constituent domain folds by an automatic consen-sus procedure which is in agreement between three inde-pendent algorithms (SJ et al. unpublished data). As well asclustering proteins by sequence and structure, recognisedfamilies are also grouped according to similarity in proteinclass (i.e. secondary structure composition and contacts).Finally, the architecture (shape, defined by the assembly ofsecondary structures, regardless of their connectivity) adop-ted by each protein fold, is assigned manually. Althoughthis is a somewhat subjective process, based largely on com-monly used descriptions in the literature (e.g. sandwich,barrel and propellor), it is an essential first step towardsordering the known folds in a useful and practical way.

1094 Structure 1997, Vol 5 No 8

Figure 1

Annual increase in the numbers of proteindomain structures in the PDB (top plot,[11,12]). The lower lines show the numbers ofidentical families (I-level, 100% sequenceidentity between structures within the familyand 100% overlap), non-identical families(N-level, > 95% sequence identity, 85%overlap), sequence families (S-level, > 35%sequence identity, 60% overlap), homologoussuperfamilies (H-level, > 25% sequenceidentity, SSAP >80 and 60% overlap), andtopological or fold families (T-level, SSAP>70), where SSAP is a structural comparisonscore.

7500��

3000�

2500�

2000�

1500�

1000�

500�

0'85 '86 '87 '88 '89 '90 '91 '92 '93 '94 '95 '96

Domain

Identical

Non-identical

Sequence familyHomologous superfamilyTopology

Num

ber o

f dom

ains

1985–95

Deposition date

Domain fold distribution

from Structure 15, August 1997, 5:1093–1108 http://biomednet.com/elecref/0969212600501093

Page 26: Protein Predic+on I for Computer Scien+sts · and it illustrated the insights to be gained from ordering the data in this way. Several other groups have also classified the known

PP1CS SoSe 17

NumberingScheme

●  C:1,2,3(alpha,beta,alpha/beta)+1●  A:samearchitecture,differenttopology(31)+10

●  T:Topology(connec+onofsecondarystructureelements)+10(505)

●  H:Homology(families)+10(645)

Page 27: Protein Predic+on I for Computer Scien+sts · and it illustrated the insights to be gained from ordering the data in this way. Several other groups have also classified the known

PP1CS SoSe 17

Pfam●  Inversion29.0,December2015,16295familiesin559clans

●  hostedbytheEBI●  Cita+on:“ThePfamproteinfamiliesdatabase:towardsamoresustainablefuture”Nucl.AcidsRes.(04January2016)44(D1):D279-D285.doi:10.1093/nar/gkv1344

Page 28: Protein Predic+on I for Computer Scien+sts · and it illustrated the insights to be gained from ordering the data in this way. Several other groups have also classified the known

PP1CS SoSe 17

Pfam●  Pfam-A:curatedseedalignmentderivedfromPfamseq(UniProtKBbased),profileHMMsfortheseedalignment,fullalignmentwithallHMMdetectedsequences

●  Pfam-B:un-annotated,automa+callygeneratedfromnon-redundantclusterfromADDA

●  focusesonsingledomains

Page 29: Protein Predic+on I for Computer Scien+sts · and it illustrated the insights to be gained from ordering the data in this way. Several other groups have also classified the known

PP1CS SoSe 17

Terms

●  Family:collec+onofrelatedproteinregions●  Domain:structuralunit

●  Repeat:shotunitwhichisunstableinisola+onbutformsastablestructurewhenfoundinmul+plecopies

●  Mo+f:shortunitfoundoutsideglobulardomains●  Clans:relatedgroupofPfamentriesbasedonsimilarityinsequence,structureofprofile-HMM

Page 30: Protein Predic+on I for Computer Scien+sts · and it illustrated the insights to be gained from ordering the data in this way. Several other groups have also classified the known

PP1CS SoSe 17

PfamNumbers(rel.27)

●  14381Pfam-Afamilies●  4563areclassifiedinto515clans

●  thePfam-Areleasematches79.9%ofthe23.2MiosequencesinthecorrespondingUniProtdb

●  coverageof90.5%ofSwissProthuman

●  useofjackhmmer(fromHMMER3package)

●  considerCATHandPDB

Page 31: Protein Predic+on I for Computer Scien+sts · and it illustrated the insights to be gained from ordering the data in this way. Several other groups have also classified the known

PP1CS SoSe 17

ProteinDataBank(PDB)

●  hGp://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/introduc+on

●  collec+onofhighresolu+onproteinstructures●  X-raycrystallography,NMR,cryo-EM

●  beawareofdifferentqualityofthedata

●  slowgrowing

Page 32: Protein Predic+on I for Computer Scien+sts · and it illustrated the insights to be gained from ordering the data in this way. Several other groups have also classified the known

PP1CS SoSe 17

RootMeanSquareDevia+on

●  measuretodeterminesimilaritybetween3D-structures

●  1)Superimpose●  1a)Alignsequencesto“guess”correspondingresidues

●  2)calculatethedistances(mostlyCα)

Page 33: Protein Predic+on I for Computer Scien+sts · and it illustrated the insights to be gained from ordering the data in this way. Several other groups have also classified the known

PP1CS SoSe 17

ProteinFunc+on

●  GeneOntology(GO)●  ECSystem

Page 34: Protein Predic+on I for Computer Scien+sts · and it illustrated the insights to be gained from ordering the data in this way. Several other groups have also classified the known

PP1CS SoSe 17

GeneOntology

●  controlledvocabulary●  hierachicalstructured

●  nearlyaDAG

●  threemainsec+ons:-  cellularcomponent-  molecularfunc+on-  biologicalprocess

Page 35: Protein Predic+on I for Computer Scien+sts · and it illustrated the insights to be gained from ordering the data in this way. Several other groups have also classified the known

PP1CS SoSe 17

GeneOntology●  id:GO:0000016

name:lactaseac+vitynamespace:molecular_func+ondef:"Catalysisofthereac+on:lactose+H2O=D-glucose+D-galactose."[EC:3.2.1.108]synonym:"lactase-phlorizinhydrolaseac+vity"BROAD[EC:3.2.1.108]synonym:"lactosegalactohydrolaseac+vity"EXACT[EC:3.2.1.108]xref:EC:3.2.1.108xref:MetaCyc:LACTASE-RXNxref:Reactome:20536is_a:GO:0004553!hydrolaseac+vity,hydrolyzingO-glycosylcompounds

Page 36: Protein Predic+on I for Computer Scien+sts · and it illustrated the insights to be gained from ordering the data in this way. Several other groups have also classified the known

PP1CS SoSe 17

EnzymeCommissionNumber(EC)

●  Numercalclassifica+onofenzymes●  Basedonthechemicalreac+onstheycatalyse

●  recommendsanenzymename

●  doesnotimplyany(phylogene+c)rela+onbetweenenzymesofthesamename(nohomology)

https://en.wikipedia.org/wiki/Enzyme_Commission_number

Page 37: Protein Predic+on I for Computer Scien+sts · and it illustrated the insights to be gained from ordering the data in this way. Several other groups have also classified the known

PP1CS SoSe 17

Group Reaction catalyzed Typical reaction Enzyme example(s) with trivial names

EC 1 Oxidoreductase

To catalyze oxidation/reduction reactions; transfer of H and O atoms or electrons from one substance to another

AH + B → A + BH (reduced) A + O → AO (oxidized)

Dehydrogenase, oxidase

EC 2 Transferases

Transfer of a functional group from one substance to another. The group may be methyl-, acyl-, amino- or phosphate group

AB + C → A + BC Transaminase, kinase

EC 3 Hydrolases

Formation of two products from a substrate by hydrolysis

AB + H2O → AOH + BH

Lipase, amylase, peptidase

Page 38: Protein Predic+on I for Computer Scien+sts · and it illustrated the insights to be gained from ordering the data in this way. Several other groups have also classified the known

PP1CS SoSe 17

Group Reaction catalyzed Typical reaction Enzyme example(s) with trivial names

EC 4 Lyases

Non-hydrolytic addition or removal of groups from substrates. C-C, C-N, C-O or C-S bonds may be cleaved

RCOCOOH → RCOH + CO2or [X-A+B-Y] → [A=B + X-Y]

Decarboxylase

EC 5 Isomerases

Intramolecule rearrangement, i.e. isomerization changes within a single molecule

ABC → BCA Isomerase, mutase

EC 6 Ligases

Join together two molecules by synthesis of new C-O, C-S, C-N or C-C bonds with simultaneous breakdown of ATP

X + Y+ ATP → XY + ADP + Pi

Synthetase

Page 39: Protein Predic+on I for Computer Scien+sts · and it illustrated the insights to be gained from ordering the data in this way. Several other groups have also classified the known

PP1CS SoSe 17

EC-Example●  tripep+deaminopep+daseshavethecode"EC3.4.11.4”

●  EC3enzymesarehydrolases(enzymesthatusewatertobreakupsomeothermolecule)

●  EC3.4arehydrolasesthatactonpep+debonds

●  EC3.4.11arethosehydrolasesthatcleaveofftheamino-terminalaminoacidfromapolypep+de

●  EC3.4.11.4arethosethatcleaveofftheamino-terminalendfromatripep+de