protein predic+on i for computer scien+sts · and it illustrated the insights to be gained from...
TRANSCRIPT
PP1CS SoSe 17
ProteinPredic+onIforComputerScien+sts
ProteinsMay18/23th,SummerTerm2017
BurkhardRost&LotharRichter
PP1CS SoSe 17
Lectureandexercise● hGps://www.rostlab.org/teaching/ss17/pp1cs● Announcements,slidesandvideos● LectureTuesdays(10:00-11:30am)andThursdays(10:00–11:30am)
● RoomMW1801(MechanicalEngineering)● ExerciseThursdays12:30–14.00pmRoomHörsaal3(MI00.06.011,Lecturehall3)andmostlyMW2250onTuesday13-15
● RegisterforthelectureandexaminTUMonline
PP1CS SoSe 17
Exercise
● ExercisewikihGps://i12r-studfilesrv.informa+k.tu-muenchen.de/sose17/pp4cs1/index.php/Main_Page
PP1CS SoSe 17
Exercise–TopicsandScheduleSlot Thursday Tuesday Topic 1 May 4th May 9th Structure of the Exercise / Biological Background 2 May 11th May 16th Biological background 3 May 18th May 23rd Protein structures 4 Jun 1st Jun 6th Resources for Biological Information / Formats 5 Jun 8th Jun 13th Alignments 6 Jun 22nd Jun 27th Machine Learning incl. Tricks / Secondary
Structure Prediction 7 Jun 29th Jul 4th Homology Modeling / Prediction of Other Protein
Features 8 Jul 6th Wrap Up – Questions
WED Jul 12th EXAM
PP1CS SoSe 17
Chirality(~Handedness)
PP1CS SoSe 17
AminoAcidsfrom https://en.wikipedia.org/wiki/Amino_acid
• Schema of an α-amino acid
PP1CS SoSe 17
AminoAcids
https://en.wikipedia.org/wiki/File:Amino_Acids.svg
PP1CS SoSe 17
AminoAcids
https://en.wikipedia.org/wiki/File:Amino_Acids.svg
PP1CS SoSe 17
AminoAcids
Essential for humans: phenylalanine, valine, threonine, tryptophan, methionine, leucine, isoleucine, lysine, and histidine https://en.wikipedia.org/wiki/File:Amino_Acids.svg
PP1CS SoSe 17
ProteinSequence/PrimaryStructure● linearsequenceofaminoacids● orientedfromN-toC-terminus
● (typically)alwaysstartswithMethionin
● IMPORTANT:considerthedifferentmeaningsCoding/Representation Protein Aspects
1D-information: sequence of amino acids as string
Primary structure: amino acid sequence
2D-information: 2D-array, contact map
Secondary structure: secondary structure elements like helices or sheets,...
3D-information: coordinates or atom couplings
Tertiary structure: spatial arrangement of secondary structure elements (incl. amino acids, atoms, ...)
PP1CS SoSe 17
SecondaryStructure
● localstructuralelements● structuralbuildingblocksforhigherorderstructures
● α-helix,β-sheet,loops
● stabilizedbyhydrogenbonds
● aminoacidshavepreferencesforcertainsecondarystructureelements
PP1CS SoSe 17
Ter+aryStructure
● spa+alarrangementofallsecondarystructureelementsofaprotein
● alterna+vearrangementscanexist(conforma+onchangesuponsubstratebinding,orinducedfit)
● canbyusedtohierarchicalorganizefoundproteinstructures
PP1CS SoSe 17
QuarternaryStructure● forma+onofmul+-proteincomplexes● manycellularprocessesarecarriedoutbymul+-proteincomplexes:
● especiallyforhighlycoordinated/regulatedac+onslike:- replica+on- transcrip+on- transla+on
● difficulttodetermineprecisely,some+mevisiblealreadyinEM
PP1CS SoSe 17
HydrogenBondsareWeakInterac+ons
● HydrogenBondsareweak:- Donor:H-atomwithpar+alposi+vechargeinpolarbond(O,N)
- Acceptor:atomwithunboundelectronpairs(O,N)andtypicallypar+alnega+vecharge
- distance:160-200pm- 1-5kcal/mol
https://en.wikipedia.org/wiki/Hydrogen_bond#/media/File:Base_pair_GC.svg
PP1CS SoSe 17
AlphaHelix● 3.6aminoacidsperturn,spiralforming● 4-40residue(mostly10)
● stabilizedbyhydrogenbondsbetweenbackboneatoms:
https://en.wikipedia.org/wiki/Alpha_helix
PP1CS SoSe 17
BetaStrand/Sheet● “longrange(intermsofinvolvedresidues)”hydrogenbonds
● paralleloran+parallel● “flat”
https://en.wikipedia.org/wiki/Beta_sheet
PP1CS SoSe 17
Loop/Turn/Coil
● generally:changeofdirec+on:alpha(4),beta(3),gamma(2),delta(1),pi(5)
● omega-loop:catchallterm,includeslongerstretches,nohydrogenbondinginvolved
● connectorbetweenbeGerdefinedsecondarystructureelementsorattheendofapolypep+dechain
PP1CS SoSe 17
RandomCoil/DisorderedRegion
● noclearsecondarystructureelementsiden+fiable
● likesta+s+caldistribu+onofshapes● biologically:canbeusedasadaptertodifferenttargetshapes,i.e.oneconforma+onisstabilizeduponinterac+onwithapartner
PP1CS SoSe 17
ProteinFeatures
● surfacearea● hydrophobicity
● size
● iso-electricpoint● aminoacidcomposi+on
● variousfunc+onalorstructuralmo+fs
PP1CS SoSe 17
RamachandranPlot
● Doublebondnatureofthepep+dbond● ϖ:0or180°,notfreelyrotatable
● φ,ψ:dependsonthespecificaminoacidsandthespecificcontext
● therearetypicalrangesforhelicesandsheets
Φ
Ψ ϖ
PP1CS SoSe 17
RamachandranPlotfrom https://en.wikipedia.org/wiki/Ramachandran_plot
180 ̊
-180 ̊
0 ̊
0 ̊-180 ̊ 180 ̊
right handed α-helix
left handed α-helix
β-sheet rigid/ fixed radius relaxed radius
PP1CS SoSe 17
Classifica+onofStructures:CATH/SCOP● cameupinthemiddleofthe1990s● botharequitesimilar
● aim:organizetheproteinstructuresavailableinPDB,basedonsingledomains
● hierarchicalsystem(roughly):- secondarystructurecontent- fold- superfamilies- families
PP1CS SoSe 17
SCOP:aStructuralClassifica+onofProteins
● Murzin,A.,Brenner,S.E.,Hubbard,T.J.P.andChothia,C.(1995)J.Mol.Biol.,247,536-540
● Hubbard,T.P.,Murzin,A.,Brenner,S.E.andChothia,C.(1997),Nucl.AcidsRes.25(1),236-239(easiertoobtain)
● fullymanuallycurated,drivenbyexpertanalysis
● associatedwiththeASTRALcompendium
● latestnews:SCOPe(UCBerkeley),SCOP2(MRCLabMolBiol,Cambridge,UK)
PP1CS SoSe 17
CATH● semi-automa+cprocedureforderivinganovelhierarchicalclassifica+onofproteindomainstructures
● fourmainlevels:- C:proteinclass,mainlysecondarystructurecomposi+onofeachdomain
- A:architecture,summarizesshapesbasedonorienta+onofsecondarystructureelements
- T:topology,sequen+alconnec+vityisconsidered- H:homologoussuperfamily,highsimilaritywithsimilarfunc+ons,evolu+onaryrela+onshipassumed
PP1CS SoSe 17
some nine highly populated families (‘superfolds’ [1]),with important implications for prediction algorithms,and it illustrated the insights to be gained from orderingthe data in this way.
Several other groups have also classified the known struc-tures, focusing on a variety of local and global topologi-cal features and employing a range of algorithms (struc-ture comparison algorithms and classification generallyare reviewed in [13–16]). The SCOP database, developedby Murzin et al. [17], groups proteins having significantsequence similarity into homologous families, whereasmore distant structural similarities are largely identifiedmanually. This database places emphasis on evolutionaryrelationships and information from the literature relatingto well-studied fold families is also incorporated (e.g. the βtrefoils [18] and the OB fold [19]). By contrast Holm andSander, use the structure comparison algorithm DALI torecognise structural neighbours, whether motif or foldbased, without formally ordering proteins in the PDB intofamilies [20]. The ENTREZ database of Hogue et al. [21],uses a similar approach to DALI, listing neighbours by afast vector-based comparison algorithm (VAST).
The task of defining structural relationships is furthercomplicated by the existence of multidomain proteins;more than 30% of non-identical structures in the currentPDB contain two or more domains. A number of domainrecognition algorithms have appeared recently to address
this problem [22–26]. The 3Dee database of Siddiquiand Barton (http://snail.biop.ox.ac.uk:8080/3Dee) sepa-rates the constituent folds of multidomain proteins usingthe DOMAK algorithm. Similarly, Sowdhaminini et al.have constructed a database of single domain families[27], using the domain recognition algorithm DIAL [26]and the structural comparison procedure SEA [28]. Bothdatabases contain data that is generated largely automati-cally, but is subsequently checked and where appropri-ate reordered manually.
In recognition of the need to regularly maintain and updatedata on structural relatives, we have further developed ourautomatic procedures for identifying and classifying struc-tural families [6] to construct a database of single-domainfold families. Any multidomain proteins are first dividedinto their constituent domain folds by an automatic consen-sus procedure which is in agreement between three inde-pendent algorithms (SJ et al. unpublished data). As well asclustering proteins by sequence and structure, recognisedfamilies are also grouped according to similarity in proteinclass (i.e. secondary structure composition and contacts).Finally, the architecture (shape, defined by the assembly ofsecondary structures, regardless of their connectivity) adop-ted by each protein fold, is assigned manually. Althoughthis is a somewhat subjective process, based largely on com-monly used descriptions in the literature (e.g. sandwich,barrel and propellor), it is an essential first step towardsordering the known folds in a useful and practical way.
1094 Structure 1997, Vol 5 No 8
Figure 1
Annual increase in the numbers of proteindomain structures in the PDB (top plot,[11,12]). The lower lines show the numbers ofidentical families (I-level, 100% sequenceidentity between structures within the familyand 100% overlap), non-identical families(N-level, > 95% sequence identity, 85%overlap), sequence families (S-level, > 35%sequence identity, 60% overlap), homologoussuperfamilies (H-level, > 25% sequenceidentity, SSAP >80 and 60% overlap), andtopological or fold families (T-level, SSAP>70), where SSAP is a structural comparisonscore.
7500��
3000�
2500�
2000�
1500�
1000�
500�
0'85 '86 '87 '88 '89 '90 '91 '92 '93 '94 '95 '96
Domain
Identical
Non-identical
Sequence familyHomologous superfamilyTopology
Num
ber o
f dom
ains
1985–95
Deposition date
Domain fold distribution
from Structure 15, August 1997, 5:1093–1108 http://biomednet.com/elecref/0969212600501093
PP1CS SoSe 17
NumberingScheme
● C:1,2,3(alpha,beta,alpha/beta)+1● A:samearchitecture,differenttopology(31)+10
● T:Topology(connec+onofsecondarystructureelements)+10(505)
● H:Homology(families)+10(645)
PP1CS SoSe 17
Pfam● Inversion29.0,December2015,16295familiesin559clans
● hostedbytheEBI● Cita+on:“ThePfamproteinfamiliesdatabase:towardsamoresustainablefuture”Nucl.AcidsRes.(04January2016)44(D1):D279-D285.doi:10.1093/nar/gkv1344
PP1CS SoSe 17
Pfam● Pfam-A:curatedseedalignmentderivedfromPfamseq(UniProtKBbased),profileHMMsfortheseedalignment,fullalignmentwithallHMMdetectedsequences
● Pfam-B:un-annotated,automa+callygeneratedfromnon-redundantclusterfromADDA
● focusesonsingledomains
PP1CS SoSe 17
Terms
● Family:collec+onofrelatedproteinregions● Domain:structuralunit
● Repeat:shotunitwhichisunstableinisola+onbutformsastablestructurewhenfoundinmul+plecopies
● Mo+f:shortunitfoundoutsideglobulardomains● Clans:relatedgroupofPfamentriesbasedonsimilarityinsequence,structureofprofile-HMM
PP1CS SoSe 17
PfamNumbers(rel.27)
● 14381Pfam-Afamilies● 4563areclassifiedinto515clans
● thePfam-Areleasematches79.9%ofthe23.2MiosequencesinthecorrespondingUniProtdb
● coverageof90.5%ofSwissProthuman
● useofjackhmmer(fromHMMER3package)
● considerCATHandPDB
PP1CS SoSe 17
ProteinDataBank(PDB)
● hGp://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/introduc+on
● collec+onofhighresolu+onproteinstructures● X-raycrystallography,NMR,cryo-EM
● beawareofdifferentqualityofthedata
● slowgrowing
PP1CS SoSe 17
RootMeanSquareDevia+on
● measuretodeterminesimilaritybetween3D-structures
● 1)Superimpose● 1a)Alignsequencesto“guess”correspondingresidues
● 2)calculatethedistances(mostlyCα)
PP1CS SoSe 17
ProteinFunc+on
● GeneOntology(GO)● ECSystem
PP1CS SoSe 17
GeneOntology
● controlledvocabulary● hierachicalstructured
● nearlyaDAG
● threemainsec+ons:- cellularcomponent- molecularfunc+on- biologicalprocess
PP1CS SoSe 17
GeneOntology● id:GO:0000016
name:lactaseac+vitynamespace:molecular_func+ondef:"Catalysisofthereac+on:lactose+H2O=D-glucose+D-galactose."[EC:3.2.1.108]synonym:"lactase-phlorizinhydrolaseac+vity"BROAD[EC:3.2.1.108]synonym:"lactosegalactohydrolaseac+vity"EXACT[EC:3.2.1.108]xref:EC:3.2.1.108xref:MetaCyc:LACTASE-RXNxref:Reactome:20536is_a:GO:0004553!hydrolaseac+vity,hydrolyzingO-glycosylcompounds
PP1CS SoSe 17
EnzymeCommissionNumber(EC)
● Numercalclassifica+onofenzymes● Basedonthechemicalreac+onstheycatalyse
● recommendsanenzymename
● doesnotimplyany(phylogene+c)rela+onbetweenenzymesofthesamename(nohomology)
https://en.wikipedia.org/wiki/Enzyme_Commission_number
PP1CS SoSe 17
Group Reaction catalyzed Typical reaction Enzyme example(s) with trivial names
EC 1 Oxidoreductase
To catalyze oxidation/reduction reactions; transfer of H and O atoms or electrons from one substance to another
AH + B → A + BH (reduced) A + O → AO (oxidized)
Dehydrogenase, oxidase
EC 2 Transferases
Transfer of a functional group from one substance to another. The group may be methyl-, acyl-, amino- or phosphate group
AB + C → A + BC Transaminase, kinase
EC 3 Hydrolases
Formation of two products from a substrate by hydrolysis
AB + H2O → AOH + BH
Lipase, amylase, peptidase
PP1CS SoSe 17
Group Reaction catalyzed Typical reaction Enzyme example(s) with trivial names
EC 4 Lyases
Non-hydrolytic addition or removal of groups from substrates. C-C, C-N, C-O or C-S bonds may be cleaved
RCOCOOH → RCOH + CO2or [X-A+B-Y] → [A=B + X-Y]
Decarboxylase
EC 5 Isomerases
Intramolecule rearrangement, i.e. isomerization changes within a single molecule
ABC → BCA Isomerase, mutase
EC 6 Ligases
Join together two molecules by synthesis of new C-O, C-S, C-N or C-C bonds with simultaneous breakdown of ATP
X + Y+ ATP → XY + ADP + Pi
Synthetase
PP1CS SoSe 17
EC-Example● tripep+deaminopep+daseshavethecode"EC3.4.11.4”
● EC3enzymesarehydrolases(enzymesthatusewatertobreakupsomeothermolecule)
● EC3.4arehydrolasesthatactonpep+debonds
● EC3.4.11arethosehydrolasesthatcleaveofftheamino-terminalaminoacidfromapolypep+de
● EC3.4.11.4arethosethatcleaveofftheamino-terminalendfromatripep+de