protein predic+on i for computer scien+sts · machine learning is of immense importance in...
TRANSCRIPT
PP1CS SoSe 17
ProteinPredic+onIforComputerScien+sts
MachineLearning&SecondaryStructurePredic+on
June22nd/27th,SummerTerm2017BurkhardRost&LotharRichter
PP1CS SoSe 17
Lectureandexercise● hGps://www.rostlab.org/teaching/ss17/pp1cs● Announcements,slidesandvideos● LectureTuesdays(10:00-11:30am)andThursdays(10:00–11:30am)
● RoomMW1801(MechanicalEngineering)● ExerciseThursdays12:30–14.00pmRoomHörsaal3(MI00.06.011,Lecturehall3)andmostlyMW2250onTuesday13-15
● RegisterforthelectureandexaminTUMonline
PP1CS SoSe 17
Exercise
● ExercisewikihGps://i12r-studfilesrv.informa+k.tu-muenchen.de/sose17/pp4cs1/index.php/Main_Page
PP1CS SoSe 17
Exercise–TopicsandScheduleSlot Thursday Tuesday Topic 1 May 4th May 9th Structure of the Exercise / Biological Background 2 May 11th May 16th Biological background 3 May 18th May 23rd Protein structures 4 Jun 1st Jun 13th Alignments 5 Jun 8th Jun 20th Resources for Biological Information / Formats 6 Jun 22nd Jun 27th Machine Learning incl. Tricks / Secondary
Structure Prediction 7 Jun 29th Jul 4th Homology Modeling / Prediction of Other Protein
Features 8 Jul 6th Wrap Up – Questions
WED Jul 12th EXAM
PP1CS SoSe 17
Ideas
● amachinelearningdevicecangeneralizefromrealworldobserva+onsintoa“formal”model
● eachmodelreflectsonlyafewaspectofreality● nomodelcancompletelyrepresentthereality,i.e.aphotographofadogremainsaphotographandnotarealdog
● themodelshouldreflectaconceptorcommonali+esandnotindividualcharacteris+cs
PP1CS SoSe 17
Induc+veBias
● everylearningschemediscardssomeaspectsofrealitytoconstructamodel
● thismaydifferbetweendifferentlearningschemes
● thismightalsoalreadyhappensontheleveloffeatureextrac+on,i.e.choosingthetypesandvaluestorepresentanobserva+on
● thisisnottobemistakenwiththepredic+vebias
PP1CS SoSe 17
Somevocabulary
● learningscheme:aspecificlearningalgorithmproducingamodellikedecisiontrees,rulebasedsystems,SVMs,Bayesiannetworks,etc.
● aGribute/feature:avariabledescribingaspecificaspectofrealworldobserva+ons,likebodyweight,color,certainpropertyfoundyes/not
● instance:asingleobserva+ondescribinganobservedeventbyassigningvaluesforeachfeatureusedtorepresentthisobserva+on
PP1CS SoSe 17
SomevocabularyII
● training:phaseofanalyzingrealworldobserva+onsinaformalizedrepresenta+ontoderiveparametersand/orinternalstructure
● testphase:phaseofmodelapplica+ontodeterminethereliabilityofstatements(predic+ons)oninstancesnotusedfortraining
● label:anaGributeselectedtobepredicted
PP1CS SoSe 17
TypesofLearning
● dependingonthepresenceofalabelwedis+nguishbetweensupervisedandunsupervisedlearning
● unsupervisedlearning:conceptlearning,frequentitemsets,clustering
● supervisedlearning:everythingwithlabeleddatawhichallowstomakeapredic+on
PP1CS SoSe 17
DataPreprocessing
Beforeyoucanstarttobuildamodelinmostcasesthedataaresubjectedtovariouspreprocessingsteps:
● FeatureExtrac+on
● Discre+za+on
● FeatureSelec+on
PP1CS SoSe 17
FeatureExtrac+on/Construc+on
● Conversionofobserva+onrecordsintoaformalized,computer-readablerepresenta+on
● defini+onofanaGributetype● assignmentofappropriateaGributevalues
● thisimpliesastronginvolvementoftheanalyst
● important:commonsense,backgroundknowledgefromexpertdomains
PP1CS SoSe 17
FeatureSelec+on● removevaluesfrominstances,i.e.discardsomefeaturesofadatasetbecausetheseare:- irrelevant- redundant- noisy/faulty
● possiblebenefits:- improveefficiencyandaccuracy- preventoverfiing- savespace
PP1CS SoSe 17
FeatureSelec+onStrategies● unsupervised:basedondomainknowledge,randomsampling
● supervised:- measuresconsidertheclass(filtering):Gini-index,informa+ongain,relief,...
- usealearningscheme’sperformance(wrapping):● selectthesetofaGributewhichleadstobestperformance● forwardselec+on:increasethesetofaGribute1by1● backwardelemina+on:decreasethesetofaGributes1by1
PP1CS SoSe 17
MachineLearningandBioinforma+cs● todaybiologyhastospanbetweentwoextremes:- statementsonthenucleo+delevel(onelevelbelowgenes)
- statementstheindividual/popula+onlevelontheotherhand
● thegaininspeedtogeneratesequencedata(nucleo+desequences)hasclearlyoutpacedthespeedofanalysisandknowledgediscovery
● currentlabtechnologyevencannotfillthegapbetweensequenceandstructure
PP1CS SoSe 17
RoleofDM/MLDataMininghelpsto:● structurethedataandcompressthedata
● filteroutmistakesandoutliersbecauseofexperimentalerrorsandothernoise
● reduceredundancy
● replacewetlabanalyseswithpredic+ons
● detectinteres+ngrela+onshipandmodelsanddirectsmanpowertowardspointswhereitisneeded
PP1CS SoSe 17
OverviewoftheStepsinKDD
ly understandable patterns in data (Fayyad,Piatetsky-Shapiro, and Smyth 1996).
Here, data are a set of facts (for example,cases in a database), and pattern is an expres-sion in some language describing a subset ofthe data or a model applicable to the subset.Hence, in our usage here, extracting a patternalso designates fitting a model to data; find-ing structure from data; or, in general, mak-ing any high-level description of a set of data.The term process implies that KDD comprisesmany steps, which involve data preparation,search for patterns, knowledge evaluation,and refinement, all repeated in multiple itera-tions. By nontrivial, we mean that somesearch or inference is involved; that is, it isnot a straightforward computation ofpredefined quantities like computing the av-erage value of a set of numbers.
The discovered patterns should be valid onnew data with some degree of certainty. Wealso want patterns to be novel (at least to thesystem and preferably to the user) and poten-tially useful, that is, lead to some benefit tothe user or task. Finally, the patterns shouldbe understandable, if not immediately thenafter some postprocessing.
The previous discussion implies that we candefine quantitative measures for evaluatingextracted patterns. In many cases, it is possi-ble to define measures of certainty (for exam-ple, estimated prediction accuracy on new
data) or utility (for example, gain, perhaps indollars saved because of better predictions orspeedup in response time of a system). No-tions such as novelty and understandabilityare much more subjective. In certain contexts,understandability can be estimated by sim-plicity (for example, the number of bits to de-scribe a pattern). An important notion, calledinterestingness (for example, see Silberschatzand Tuzhilin [1995] and Piatetsky-Shapiro andMatheus [1994]), is usually taken as an overallmeasure of pattern value, combining validity,novelty, usefulness, and simplicity. Interest-ingness functions can be defined explicitly orcan be manifested implicitly through an or-dering placed by the KDD system on the dis-covered patterns or models.
Given these notions, we can consider apattern to be knowledge if it exceeds some in-terestingness threshold, which is by nomeans an attempt to define knowledge in thephilosophical or even the popular view. As amatter of fact, knowledge in this definition ispurely user oriented and domain specific andis determined by whatever functions andthresholds the user chooses.
Data mining is a step in the KDD processthat consists of applying data analysis anddiscovery algorithms that, under acceptablecomputational efficiency limitations, pro-duce a particular enumeration of patterns (ormodels) over the data. Note that the space of
Articles
FALL 1996 41
Data
Transformed�Data
Patterns
Preprocessing
Data Mining
Interpretation / �Evaluation
Transformation
Selection
--- --- ------ --- ------ --- ---
Knowledge
Preprocessed Data
Target Date
Figure 1. An Overview of the Steps That Compose the KDD Process.taken from U. Fayyad, G. Piatetsky-Shapiro, P. Smyth “From Data Mining to Knowledge Discovery in Databases” (1996) AI Magazine, 17, 37-54
PP1CS SoSe 17
MLToolsemployedinBioinforma+csandCooccurenceofMethods
takenfrom“Theriseandfallofsupervisedmachinelearningtechniques”
[14:56 18/11/2011 Bioinformatics-btr585.tex] Page: 3331 3331–3332
BIOINFORMATICS EDITORIAL Vol. 27 no. 24 2011, pages 3331–3332doi:10.1093/bioinformatics/btr585
Editorial
The rise and fall of supervised machine learning techniquesLars Juhl Jensen1,∗ and Alex Bateman2
1Department of Disease Systems Biology, The Novo Nordisk Foundation Center for Protein Research, Faculty ofHealth Sciences, University of Copenhagen, DK-2200 Copenhagen N, Denmark and 2Wellcome Trust SangerInstitute, Wellcome Trust Genome Campus, Hinxton CB10 1SA UK
Machine learning is of immense importance in bioinformatics andbiomedical science more generally (Larrañaga et al., 2006; Tarcaet al., 2007). In particular, supervised machine learning has beenused to great effect in numerous bioinformatics prediction methods.Through many years of editing and reviewing manuscripts, wenoticed that some supervised machine learning techniques seem tobe gaining in popularity while others seemed, at least to our eyes,to be looking ‘unfashionable’.
We were motivated to create a league table of machine learningtechniques to learn what is hot and what is not in the machinelearning field. In this editorial, we only include those that weconsidered major league and leave analysis of the minor leaguemethods as an exercise for the interested reader. To create our leaguetable, we created a list of supervised machine learning techniquescommonly used in bioinformatics and their common synonyms,plural forms and abbreviations. We then searched this list againstthe PubMed titles and abstracts to identify the number of paperspublished per year for each machine learning technique. To match asmany papers as possible, searches were case insensitive and allowedfor variation in hyphenation.
Fig. 1. The growth of supervised machine learning methods in PubMed.
∗To whom correspondence should be addressed
To our surprise, the artificial neural network (ANN) is not onlythe dominant league leader in 2011 but has been in this positionsince at least the 1970s (see Fig. 1). However, in recent years theusage of support vector machines (SVMs) grew tremendously, andwe predict that SVMs will challengeANNs for the dominant positionin the coming decade. Since 2007 the number of publications usingANNs has decreased by 21%, which we hypothesize may be directlyattributed to researchers increasingly using SVMs in place of ANNs.SVMs caught up with and overtook Markov models in 2004 to gainsecond spot in our machine learning league.
As for the question of ‘what is hot?’, one can see that Randomforests are a rapidly growing method with not a single mention ofthem before 2003 and now a total of 407 papers published to date.
We were hoping to find techniques that were not so hot andperhaps going out of fashion. The results show that none of themajor league methods has gone out of fashion, but we do seemoderate decreases in the use of both ANNs and Markov models inthe literature.
We were also curious to find out if certain machine learningtechniques were used in combination with each other. To investigatethis, we looked at what machine learning methods are co-mentionedin articles (See Fig. 2). For all pairs of methods from the Supervised
Fig. 2. Heatmap showing the co-occurrence of machine learning techniqueswithin articles.
© The Author(s) 2011. Published by Oxford University Press.This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
at Universitatsbibliothek der Technischen U
niversitaet Muenchen Zw
eigbibliothe on June 10, 2014http://bioinform
atics.oxfordjournals.org/D
ownloaded from
[14:56 18/11/2011 Bioinformatics-btr585.tex] Page: 3331 3331–3332
BIOINFORMATICS EDITORIAL Vol. 27 no. 24 2011, pages 3331–3332doi:10.1093/bioinformatics/btr585
Editorial
The rise and fall of supervised machine learning techniquesLars Juhl Jensen1,∗ and Alex Bateman2
1Department of Disease Systems Biology, The Novo Nordisk Foundation Center for Protein Research, Faculty ofHealth Sciences, University of Copenhagen, DK-2200 Copenhagen N, Denmark and 2Wellcome Trust SangerInstitute, Wellcome Trust Genome Campus, Hinxton CB10 1SA UK
Machine learning is of immense importance in bioinformatics andbiomedical science more generally (Larrañaga et al., 2006; Tarcaet al., 2007). In particular, supervised machine learning has beenused to great effect in numerous bioinformatics prediction methods.Through many years of editing and reviewing manuscripts, wenoticed that some supervised machine learning techniques seem tobe gaining in popularity while others seemed, at least to our eyes,to be looking ‘unfashionable’.
We were motivated to create a league table of machine learningtechniques to learn what is hot and what is not in the machinelearning field. In this editorial, we only include those that weconsidered major league and leave analysis of the minor leaguemethods as an exercise for the interested reader. To create our leaguetable, we created a list of supervised machine learning techniquescommonly used in bioinformatics and their common synonyms,plural forms and abbreviations. We then searched this list againstthe PubMed titles and abstracts to identify the number of paperspublished per year for each machine learning technique. To match asmany papers as possible, searches were case insensitive and allowedfor variation in hyphenation.
Fig. 1. The growth of supervised machine learning methods in PubMed.
∗To whom correspondence should be addressed
To our surprise, the artificial neural network (ANN) is not onlythe dominant league leader in 2011 but has been in this positionsince at least the 1970s (see Fig. 1). However, in recent years theusage of support vector machines (SVMs) grew tremendously, andwe predict that SVMs will challengeANNs for the dominant positionin the coming decade. Since 2007 the number of publications usingANNs has decreased by 21%, which we hypothesize may be directlyattributed to researchers increasingly using SVMs in place of ANNs.SVMs caught up with and overtook Markov models in 2004 to gainsecond spot in our machine learning league.
As for the question of ‘what is hot?’, one can see that Randomforests are a rapidly growing method with not a single mention ofthem before 2003 and now a total of 407 papers published to date.
We were hoping to find techniques that were not so hot andperhaps going out of fashion. The results show that none of themajor league methods has gone out of fashion, but we do seemoderate decreases in the use of both ANNs and Markov models inthe literature.
We were also curious to find out if certain machine learningtechniques were used in combination with each other. To investigatethis, we looked at what machine learning methods are co-mentionedin articles (See Fig. 2). For all pairs of methods from the Supervised
Fig. 2. Heatmap showing the co-occurrence of machine learning techniqueswithin articles.
© The Author(s) 2011. Published by Oxford University Press.This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
at Universitatsbibliothek der Technischen U
niversitaet Muenchen Zw
eigbibliothe on June 10, 2014http://bioinform
atics.oxfordjournals.org/D
ownloaded from
L. J. Jensen & A. Bateman, 2011. Bioinformatics, 27(24), 3331-2
PP1CS SoSe 17
PossibleExplana+onsthePrevalenceofANNsandSVMs
● theyarecapabletohandleahugenumberofaGributes
● theyarequiterobustagainstuninforma+vefeatures
● theyimplicitlyadjustfeatureweightsduringthetrainingphase
● theyworksufficientlywell
PP1CS SoSe 17
PossibleExplana+onsthePrevalenceofANNsandSVMs
● Youdonotneedtohaveanideaaboutthemeaningofaninput
● i.e.nobackgroundknowledgeorunderstandingforfeatureselec+onorevenstrongerforfeaturegenera+onnecessary
● Disadvantage:Thesemethodsare“blackbox”models,soinspec+ngthemodeldoesnotreallyincreaseyouknowledge/understanding
PP1CS SoSe 17
HowisMachineLearningInfluencedbyUnderlyingAssump+ons?
● thereareanumberofassump+onsinthevariousprocessingsteps
● theperformancedependsonthattheseassump+onshold
● veryorenwecannotreallycheckorproofifthisistrue
PP1CS SoSe 17
Backgrounddistribu+on
● weassumethatthebackgrounddistribu+onisuniform
● i.e.theunderlyingsourceemitsinstanceswithconstantprobabili+esover+me
● possiblesolu+ons:- usemanyfeaturestorepresentcomplexscenarios- usestreamminingalgorithmswhichupdateparameters
PP1CS SoSe 17
InsufficientModelComplexity
x
x x
x x
x
x x
x
PP1CS SoSe 17
UnfairSampling
● dueto“experimental”reasonsthesamplerepresentsonlyaspecialsubsetoftheen++es
● especiallydifficultforlazylearningmethodslikek-nearest-neighbors
● possiblesolu+ons:- removeredundancy- usestra+fica+on- checkvarianceandiden+fydifficultinstances
PP1CS SoSe 17
x
x
x
x x x x
x x
x x
x x
x x
x x
x x
x
PP1CS SoSe 17
RedundancyReduc+on–MoreReasons
PP1CS SoSe 17
RedundancyReduc+on
● thecollec+onofthedataistypicallygovernedbyaspecificresearchtask
● thesamplingofthe“global”distribu+onisnotfair
● modelstrytominimizetheerrorOVERALLinstances
● stay“local”withyourpredic+ons(knowyourlimits)
● applyredundancyreduc+ontomakethedataa“fair”sample
PP1CS SoSe 17
RedundancyReduc+on
● CD-HIT:clusterssequencesaccordingtoausergiventhreshold(CD-HIT:acceleratedforclusteringthenextgenera+onsequencingdata.LiminFu,BeifangNiu,ZhengweiZhu,SitaoWuandWeizhongLi,Bioinforma+cs(2012)28:3150-3152,doi:10.1093/bioinforma+cs/bts565)
● UniqueProt:createsrepresenta+ve,unbiasedsetsofproteinsequencesbasedonHSSPvalues(Mika&Rost,NucleicAcidsRes.2003Jul1;31(13):3789–3791.,hGps://www.ncbi.nlm.nih.gov/pmc/ar+cles/PMC169026/)
PP1CS SoSe 17
SuitableRepresenta+on● weassumethattheselectedfeaturesetcanrepresenttheconcepttolearn
● sinceweorendonotknowcausalrela+onshipswemightmistakeemployedfeatureswithrealcausingones
● e.g.numberofstorksandbirthsinGermany
● e.g.openingumbrellasandrain
● remedy:usebackgroundknowledge,carefulinterpreta+on
PP1CS SoSe 17
Representa+onoftheConcept
Weather Temperature Wind playTennis rainy high no yes dry medium yes yes rainy medium yes no dry medium no no
• with this attributes it is really hard to learn favorable conditions for playing tennis
• we unconsciously assume that these attribute are sufficient to describe the scenario
PP1CS SoSe 17
Representa+onoftheConcept
Weather Temperature Wind Buddy playTennis rainy high no yes yes dry medium yes yes yes rainy medium yes no no dry medium no no no
• actually there no proof to show whether important attributes are missing or not
• for this aspect background knowledge and experience are most important
PP1CS SoSe 17
PerformanceEs+ma+on● trainingamodelmeansdiscardingandkeepingsomeinforma+onfromtheobservedinstances
● dependingonthelearningschemeinstancespecificinforma+onisstoredtoo
● instancespecificinforma+onleadstooverfiing
● overfiing:apredic+onmodelisbiasedtowardsthetrainingexamples,i.e.:- beGerperformanceontrainingexamples- worseperformanceonnewinstances
PP1CS SoSe 17
MoreRealis+cEs+ma+ons● mostop+mis+ces+ma+on:Resubsitu+onerror(determineperformanceonthesetcompletelyusedfortraining)
● ifyouhavealotofdata:determinetheerroronanindependenttestset
PP1CS SoSe 17
MoreRealis+cEs+ma+onsII● LOOCV:Leaveonoutcrossvalida+on- alwaysoneexampleisholdoutfortes+ng,theremainingfortraining
- nitera+onswithninstances,finalresultistheaverage
- s+llquitebiased- usetochecktheinfluenceofindividualinstances- ifyouhaveasmallnumberofinstances
PP1CS SoSe 17
MoreRealis+cEs+ma+onsIII● n-foldcrossvalida+on,typicallyn=10- par++onthedatainnpar++on- usen-1par++onsfortraining- use1par++onforperformanceassessment- repeatwithadifferenthold-outpar++on- averageperformance
● everystepwhereclassinforma+onisconsideredhastobeincludedintheloop!(donearerpar++oning)
PP1CS SoSe 17
Ar+ficialNeuralNetworks● s+llmostprevalentmachinelearningschemeinbioinforma+cs
● typicallyFeed-ForwardsMul+-Layer-Perceptrons(hGps://ujjwalkarn.me/2016/08/09/quick-intro-neural-networks/)
● ErrorBack-Propaga+on(hGp://home.agh.edu.pl/~vlsi/AI/backp_t_en/backprop.html)
PP1CS SoSe 17
Ar+ficialNeuralNetworks● differentac+va+onfunc+ons
● ● considerthenumberoffreeparametersinrespecttothenumberofavailabletraininginstances
● determinethenumberofepochs
PP1CS SoSe 17
Ar+ficialNeuralNetworks● bothtoomanyfreeparameters(edges)aswellasovertrainingleadstooverfiing
● in+alizetheweightswithrandomvaluesfromthelinearregionoftheac+va+onfunc+on
● repeatseveral+metoavoidgeingstuckinlocalminima
● learningtheweightsANDdeterminetheop+mumnumberofepochbelongtothetrainingphase(s.t.referredastrainingandcrosstraininingortrainingandvalida+on)
PP1CS SoSe 17
Usecase
● DevelopapredictorforProteasesbelongingtoacertainfold(3D-structure)
● CheckPDBforrespec+veentries(structure&func+onannota+on)
● searchdatabaseforsimilarsequences
● sanitycheck:predictstructureelements/checkfunc+onannota+ons
● =>compileposi+vetrainingset
PP1CS SoSe 17
Usecase● compileanappropriatenega+vetrainingsetwith- somefoldbutdifferentfunc+on- samefunc+onbutdifferentfolds- inreallifeproblems:thisisamajorchallenge
● decideaboutthecoding(features)andrecodeyourdataset
● trainyoumethod
● evaluateyourmethod
● es+matestability/confidence
PP1CS SoSe 17
Overfiing
● performancefeature(undesired)● detec+on:- performanceontestinstancesissignificantlyworsethanontraininginstances
● reasons:- toohighmodelcomplexity(toomanyparameters)- toomanytrainingepochs(neuralnetworks)
PP1CS SoSe 17
ClassImbalances
● learningschemestendtominimizethepredic+onerroroverallinstances
● iftheposi+veclassissmallthenerrorsontheposi+vedoesmaGeranymore(falsenega+ves)
● solu+ons:- oversampleminorityclass- downsamplethemajorityclass- assignweightstothedifferenterrortypes
PP1CS SoSe 17
HSSPCurve