protein predic+on i for computer scien+sts · machine learning is of immense importance in...

PP1CS SoSe 17

ProteinPredic+onIforComputerScien+sts

MachineLearning&SecondaryStructurePredic+on

June22nd/27th,SummerTerm2017BurkhardRost&LotharRichter

PP1CS SoSe 17

Lectureandexercise●  hGps://www.rostlab.org/teaching/ss17/pp1cs●  Announcements,slidesandvideos●  LectureTuesdays(10:00-11:30am)andThursdays(10:00–11:30am)

●  RoomMW1801(MechanicalEngineering)●  ExerciseThursdays12:30–14.00pmRoomHörsaal3(MI00.06.011,Lecturehall3)andmostlyMW2250onTuesday13-15

●  RegisterforthelectureandexaminTUMonline

PP1CS SoSe 17

Exercise

●  ExercisewikihGps://i12r-studfilesrv.informa+k.tu-muenchen.de/sose17/pp4cs1/index.php/Main_Page

PP1CS SoSe 17

Exercise–TopicsandScheduleSlot Thursday Tuesday Topic 1 May 4th May 9th Structure of the Exercise / Biological Background 2 May 11th May 16th Biological background 3 May 18th May 23rd Protein structures 4 Jun 1st Jun 13th Alignments 5 Jun 8th Jun 20th Resources for Biological Information / Formats 6 Jun 22nd Jun 27th Machine Learning incl. Tricks / Secondary

Structure Prediction 7 Jun 29th Jul 4th Homology Modeling / Prediction of Other Protein

Features 8 Jul 6th Wrap Up – Questions

WED Jul 12th EXAM

PP1CS SoSe 17

Ideas

●  amachinelearningdevicecangeneralizefromrealworldobserva+onsintoa“formal”model

●  eachmodelreflectsonlyafewaspectofreality●  nomodelcancompletelyrepresentthereality,i.e.aphotographofadogremainsaphotographandnotarealdog

●  themodelshouldreflectaconceptorcommonali+esandnotindividualcharacteris+cs

PP1CS SoSe 17

Induc+veBias

●  everylearningschemediscardssomeaspectsofrealitytoconstructamodel

●  thismaydifferbetweendifferentlearningschemes

●  thismightalsoalreadyhappensontheleveloffeatureextrac+on,i.e.choosingthetypesandvaluestorepresentanobserva+on

●  thisisnottobemistakenwiththepredic+vebias

PP1CS SoSe 17

Somevocabulary

●  learningscheme:aspecificlearningalgorithmproducingamodellikedecisiontrees,rulebasedsystems,SVMs,Bayesiannetworks,etc.

●  aGribute/feature:avariabledescribingaspecificaspectofrealworldobserva+ons,likebodyweight,color,certainpropertyfoundyes/not

●  instance:asingleobserva+ondescribinganobservedeventbyassigningvaluesforeachfeatureusedtorepresentthisobserva+on

PP1CS SoSe 17

SomevocabularyII

●  training:phaseofanalyzingrealworldobserva+onsinaformalizedrepresenta+ontoderiveparametersand/orinternalstructure

●  testphase:phaseofmodelapplica+ontodeterminethereliabilityofstatements(predic+ons)oninstancesnotusedfortraining

●  label:anaGributeselectedtobepredicted

PP1CS SoSe 17

TypesofLearning

●  dependingonthepresenceofalabelwedis+nguishbetweensupervisedandunsupervisedlearning

●  unsupervisedlearning:conceptlearning,frequentitemsets,clustering

●  supervisedlearning:everythingwithlabeleddatawhichallowstomakeapredic+on

PP1CS SoSe 17

DataPreprocessing

Beforeyoucanstarttobuildamodelinmostcasesthedataaresubjectedtovariouspreprocessingsteps:

●  FeatureExtrac+on

●  Discre+za+on

●  FeatureSelec+on

PP1CS SoSe 17

FeatureExtrac+on/Construc+on

●  Conversionofobserva+onrecordsintoaformalized,computer-readablerepresenta+on

●  defini+onofanaGributetype●  assignmentofappropriateaGributevalues

●  thisimpliesastronginvolvementoftheanalyst

●  important:commonsense,backgroundknowledgefromexpertdomains

PP1CS SoSe 17

FeatureSelec+on●  removevaluesfrominstances,i.e.discardsomefeaturesofadatasetbecausetheseare:-  irrelevant-  redundant-  noisy/faulty

●  possiblebenefits:-  improveefficiencyandaccuracy-  preventoverfiing-  savespace

PP1CS SoSe 17

FeatureSelec+onStrategies●  unsupervised:basedondomainknowledge,randomsampling

●  supervised:-  measuresconsidertheclass(filtering):Gini-index,informa+ongain,relief,...

-  usealearningscheme’sperformance(wrapping):●  selectthesetofaGributewhichleadstobestperformance●  forwardselec+on:increasethesetofaGribute1by1●  backwardelemina+on:decreasethesetofaGributes1by1

PP1CS SoSe 17

MachineLearningandBioinforma+cs●  todaybiologyhastospanbetweentwoextremes:-  statementsonthenucleo+delevel(onelevelbelowgenes)

-  statementstheindividual/popula+onlevelontheotherhand

●  thegaininspeedtogeneratesequencedata(nucleo+desequences)hasclearlyoutpacedthespeedofanalysisandknowledgediscovery

●  currentlabtechnologyevencannotfillthegapbetweensequenceandstructure

PP1CS SoSe 17

RoleofDM/MLDataMininghelpsto:●  structurethedataandcompressthedata

●  filteroutmistakesandoutliersbecauseofexperimentalerrorsandothernoise

●  reduceredundancy

●  replacewetlabanalyseswithpredic+ons

●  detectinteres+ngrela+onshipandmodelsanddirectsmanpowertowardspointswhereitisneeded

PP1CS SoSe 17

OverviewoftheStepsinKDD

ly understandable patterns in data (Fayyad,Piatetsky-Shapiro, and Smyth 1996).

Here, data are a set of facts (for example,cases in a database), and pattern is an expres-sion in some language describing a subset ofthe data or a model applicable to the subset.Hence, in our usage here, extracting a patternalso designates fitting a model to data; find-ing structure from data; or, in general, mak-ing any high-level description of a set of data.The term process implies that KDD comprisesmany steps, which involve data preparation,search for patterns, knowledge evaluation,and refinement, all repeated in multiple itera-tions. By nontrivial, we mean that somesearch or inference is involved; that is, it isnot a straightforward computation ofpredefined quantities like computing the av-erage value of a set of numbers.

The discovered patterns should be valid onnew data with some degree of certainty. Wealso want patterns to be novel (at least to thesystem and preferably to the user) and poten-tially useful, that is, lead to some benefit tothe user or task. Finally, the patterns shouldbe understandable, if not immediately thenafter some postprocessing.

The previous discussion implies that we candefine quantitative measures for evaluatingextracted patterns. In many cases, it is possi-ble to define measures of certainty (for exam-ple, estimated prediction accuracy on new

data) or utility (for example, gain, perhaps indollars saved because of better predictions orspeedup in response time of a system). No-tions such as novelty and understandabilityare much more subjective. In certain contexts,understandability can be estimated by sim-plicity (for example, the number of bits to de-scribe a pattern). An important notion, calledinterestingness (for example, see Silberschatzand Tuzhilin [1995] and Piatetsky-Shapiro andMatheus [1994]), is usually taken as an overallmeasure of pattern value, combining validity,novelty, usefulness, and simplicity. Interest-ingness functions can be defined explicitly orcan be manifested implicitly through an or-dering placed by the KDD system on the dis-covered patterns or models.

Given these notions, we can consider apattern to be knowledge if it exceeds some in-terestingness threshold, which is by nomeans an attempt to define knowledge in thephilosophical or even the popular view. As amatter of fact, knowledge in this definition ispurely user oriented and domain specific andis determined by whatever functions andthresholds the user chooses.

Data mining is a step in the KDD processthat consists of applying data analysis anddiscovery algorithms that, under acceptablecomputational efficiency limitations, pro-duce a particular enumeration of patterns (ormodels) over the data. Note that the space of

Articles

FALL 1996 41

Data

Transformed�Data

Patterns

Preprocessing

Data Mining

Interpretation / �Evaluation

Transformation

Selection

--- --- ------ --- ------ --- ---

Knowledge

Preprocessed Data

Target Date

Figure 1. An Overview of the Steps That Compose the KDD Process.taken from U. Fayyad, G. Piatetsky-Shapiro, P. Smyth “From Data Mining to Knowledge Discovery in Databases” (1996) AI Magazine, 17, 37-54

PP1CS SoSe 17

MLToolsemployedinBioinforma+csandCooccurenceofMethods

takenfrom“Theriseandfallofsupervisedmachinelearningtechniques”

[14:56 18/11/2011 Bioinformatics-btr585.tex] Page: 3331 3331–3332

BIOINFORMATICS EDITORIAL Vol. 27 no. 24 2011, pages 3331–3332doi:10.1093/bioinformatics/btr585

Editorial

The rise and fall of supervised machine learning techniquesLars Juhl Jensen1,∗ and Alex Bateman2

1Department of Disease Systems Biology, The Novo Nordisk Foundation Center for Protein Research, Faculty ofHealth Sciences, University of Copenhagen, DK-2200 Copenhagen N, Denmark and 2Wellcome Trust SangerInstitute, Wellcome Trust Genome Campus, Hinxton CB10 1SA UK

Machine learning is of immense importance in bioinformatics andbiomedical science more generally (Larrañaga et al., 2006; Tarcaet al., 2007). In particular, supervised machine learning has beenused to great effect in numerous bioinformatics prediction methods.Through many years of editing and reviewing manuscripts, wenoticed that some supervised machine learning techniques seem tobe gaining in popularity while others seemed, at least to our eyes,to be looking ‘unfashionable’.

We were motivated to create a league table of machine learningtechniques to learn what is hot and what is not in the machinelearning field. In this editorial, we only include those that weconsidered major league and leave analysis of the minor leaguemethods as an exercise for the interested reader. To create our leaguetable, we created a list of supervised machine learning techniquescommonly used in bioinformatics and their common synonyms,plural forms and abbreviations. We then searched this list againstthe PubMed titles and abstracts to identify the number of paperspublished per year for each machine learning technique. To match asmany papers as possible, searches were case insensitive and allowedfor variation in hyphenation.

Fig. 1. The growth of supervised machine learning methods in PubMed.

∗To whom correspondence should be addressed

To our surprise, the artificial neural network (ANN) is not onlythe dominant league leader in 2011 but has been in this positionsince at least the 1970s (see Fig. 1). However, in recent years theusage of support vector machines (SVMs) grew tremendously, andwe predict that SVMs will challengeANNs for the dominant positionin the coming decade. Since 2007 the number of publications usingANNs has decreased by 21%, which we hypothesize may be directlyattributed to researchers increasingly using SVMs in place of ANNs.SVMs caught up with and overtook Markov models in 2004 to gainsecond spot in our machine learning league.

As for the question of ‘what is hot?’, one can see that Randomforests are a rapidly growing method with not a single mention ofthem before 2003 and now a total of 407 papers published to date.

We were hoping to find techniques that were not so hot andperhaps going out of fashion. The results show that none of themajor league methods has gone out of fashion, but we do seemoderate decreases in the use of both ANNs and Markov models inthe literature.

We were also curious to find out if certain machine learningtechniques were used in combination with each other. To investigatethis, we looked at what machine learning methods are co-mentionedin articles (See Fig. 2). For all pairs of methods from the Supervised

Fig. 2. Heatmap showing the co-occurrence of machine learning techniqueswithin articles.

© The Author(s) 2011. Published by Oxford University Press.This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

at Universitatsbibliothek der Technischen U

niversitaet Muenchen Zw

eigbibliothe on June 10, 2014http://bioinform

atics.oxfordjournals.org/D

ownloaded from

[14:56 18/11/2011 Bioinformatics-btr585.tex] Page: 3331 3331–3332

BIOINFORMATICS EDITORIAL Vol. 27 no. 24 2011, pages 3331–3332doi:10.1093/bioinformatics/btr585

Editorial

The rise and fall of supervised machine learning techniquesLars Juhl Jensen1,∗ and Alex Bateman2

1Department of Disease Systems Biology, The Novo Nordisk Foundation Center for Protein Research, Faculty ofHealth Sciences, University of Copenhagen, DK-2200 Copenhagen N, Denmark and 2Wellcome Trust SangerInstitute, Wellcome Trust Genome Campus, Hinxton CB10 1SA UK

Machine learning is of immense importance in bioinformatics andbiomedical science more generally (Larrañaga et al., 2006; Tarcaet al., 2007). In particular, supervised machine learning has beenused to great effect in numerous bioinformatics prediction methods.Through many years of editing and reviewing manuscripts, wenoticed that some supervised machine learning techniques seem tobe gaining in popularity while others seemed, at least to our eyes,to be looking ‘unfashionable’.

We were motivated to create a league table of machine learningtechniques to learn what is hot and what is not in the machinelearning field. In this editorial, we only include those that weconsidered major league and leave analysis of the minor leaguemethods as an exercise for the interested reader. To create our leaguetable, we created a list of supervised machine learning techniquescommonly used in bioinformatics and their common synonyms,plural forms and abbreviations. We then searched this list againstthe PubMed titles and abstracts to identify the number of paperspublished per year for each machine learning technique. To match asmany papers as possible, searches were case insensitive and allowedfor variation in hyphenation.

Fig. 1. The growth of supervised machine learning methods in PubMed.

∗To whom correspondence should be addressed

To our surprise, the artificial neural network (ANN) is not onlythe dominant league leader in 2011 but has been in this positionsince at least the 1970s (see Fig. 1). However, in recent years theusage of support vector machines (SVMs) grew tremendously, andwe predict that SVMs will challengeANNs for the dominant positionin the coming decade. Since 2007 the number of publications usingANNs has decreased by 21%, which we hypothesize may be directlyattributed to researchers increasingly using SVMs in place of ANNs.SVMs caught up with and overtook Markov models in 2004 to gainsecond spot in our machine learning league.

As for the question of ‘what is hot?’, one can see that Randomforests are a rapidly growing method with not a single mention ofthem before 2003 and now a total of 407 papers published to date.

We were hoping to find techniques that were not so hot andperhaps going out of fashion. The results show that none of themajor league methods has gone out of fashion, but we do seemoderate decreases in the use of both ANNs and Markov models inthe literature.

We were also curious to find out if certain machine learningtechniques were used in combination with each other. To investigatethis, we looked at what machine learning methods are co-mentionedin articles (See Fig. 2). For all pairs of methods from the Supervised

Fig. 2. Heatmap showing the co-occurrence of machine learning techniqueswithin articles.

© The Author(s) 2011. Published by Oxford University Press.This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

at Universitatsbibliothek der Technischen U

niversitaet Muenchen Zw

eigbibliothe on June 10, 2014http://bioinform

atics.oxfordjournals.org/D

ownloaded from

L. J. Jensen & A. Bateman, 2011. Bioinformatics, 27(24), 3331-2

PP1CS SoSe 17

PossibleExplana+onsthePrevalenceofANNsandSVMs

●  theyarecapabletohandleahugenumberofaGributes

●  theyarequiterobustagainstuninforma+vefeatures

●  theyimplicitlyadjustfeatureweightsduringthetrainingphase

●  theyworksufficientlywell

PP1CS SoSe 17

PossibleExplana+onsthePrevalenceofANNsandSVMs

●  Youdonotneedtohaveanideaaboutthemeaningofaninput

●  i.e.nobackgroundknowledgeorunderstandingforfeatureselec+onorevenstrongerforfeaturegenera+onnecessary

●  Disadvantage:Thesemethodsare“blackbox”models,soinspec+ngthemodeldoesnotreallyincreaseyouknowledge/understanding

PP1CS SoSe 17

HowisMachineLearningInfluencedbyUnderlyingAssump+ons?

●  thereareanumberofassump+onsinthevariousprocessingsteps

●  theperformancedependsonthattheseassump+onshold

●  veryorenwecannotreallycheckorproofifthisistrue

PP1CS SoSe 17

Backgrounddistribu+on

●  weassumethatthebackgrounddistribu+onisuniform

●  i.e.theunderlyingsourceemitsinstanceswithconstantprobabili+esover+me

●  possiblesolu+ons:-  usemanyfeaturestorepresentcomplexscenarios-  usestreamminingalgorithmswhichupdateparameters

PP1CS SoSe 17

InsufficientModelComplexity

x

x x

x x

x

x x

x

PP1CS SoSe 17

UnfairSampling

●  dueto“experimental”reasonsthesamplerepresentsonlyaspecialsubsetoftheen++es

●  especiallydifficultforlazylearningmethodslikek-nearest-neighbors

●  possiblesolu+ons:-  removeredundancy-  usestra+fica+on-  checkvarianceandiden+fydifficultinstances

PP1CS SoSe 17

x

x

x

x x x x

x x

x x

x x

x x

x x

x x

x

PP1CS SoSe 17

RedundancyReduc+on–MoreReasons

PP1CS SoSe 17

RedundancyReduc+on

●  thecollec+onofthedataistypicallygovernedbyaspecificresearchtask

●  thesamplingofthe“global”distribu+onisnotfair

●  modelstrytominimizetheerrorOVERALLinstances

●  stay“local”withyourpredic+ons(knowyourlimits)

●  applyredundancyreduc+ontomakethedataa“fair”sample

PP1CS SoSe 17

RedundancyReduc+on

●  CD-HIT:clusterssequencesaccordingtoausergiventhreshold(CD-HIT:acceleratedforclusteringthenextgenera+onsequencingdata.LiminFu,BeifangNiu,ZhengweiZhu,SitaoWuandWeizhongLi,Bioinforma+cs(2012)28:3150-3152,doi:10.1093/bioinforma+cs/bts565)

●  UniqueProt:createsrepresenta+ve,unbiasedsetsofproteinsequencesbasedonHSSPvalues(Mika&Rost,NucleicAcidsRes.2003Jul1;31(13):3789–3791.,hGps://www.ncbi.nlm.nih.gov/pmc/ar+cles/PMC169026/)

PP1CS SoSe 17

SuitableRepresenta+on●  weassumethattheselectedfeaturesetcanrepresenttheconcepttolearn

●  sinceweorendonotknowcausalrela+onshipswemightmistakeemployedfeatureswithrealcausingones

●  e.g.numberofstorksandbirthsinGermany

●  e.g.openingumbrellasandrain

●  remedy:usebackgroundknowledge,carefulinterpreta+on

PP1CS SoSe 17

Representa+onoftheConcept

Weather Temperature Wind playTennis rainy high no yes dry medium yes yes rainy medium yes no dry medium no no

•  with this attributes it is really hard to learn favorable conditions for playing tennis

•  we unconsciously assume that these attribute are sufficient to describe the scenario

PP1CS SoSe 17

Representa+onoftheConcept

Weather Temperature Wind Buddy playTennis rainy high no yes yes dry medium yes yes yes rainy medium yes no no dry medium no no no

•  actually there no proof to show whether important attributes are missing or not

•  for this aspect background knowledge and experience are most important

PP1CS SoSe 17

PerformanceEs+ma+on●  trainingamodelmeansdiscardingandkeepingsomeinforma+onfromtheobservedinstances

●  dependingonthelearningschemeinstancespecificinforma+onisstoredtoo

●  instancespecificinforma+onleadstooverfiing

●  overfiing:apredic+onmodelisbiasedtowardsthetrainingexamples,i.e.:-  beGerperformanceontrainingexamples-  worseperformanceonnewinstances

PP1CS SoSe 17

MoreRealis+cEs+ma+ons●  mostop+mis+ces+ma+on:Resubsitu+onerror(determineperformanceonthesetcompletelyusedfortraining)

●  ifyouhavealotofdata:determinetheerroronanindependenttestset

PP1CS SoSe 17

MoreRealis+cEs+ma+onsII●  LOOCV:Leaveonoutcrossvalida+on-  alwaysoneexampleisholdoutfortes+ng,theremainingfortraining

-  nitera+onswithninstances,finalresultistheaverage

-  s+llquitebiased-  usetochecktheinfluenceofindividualinstances-  ifyouhaveasmallnumberofinstances

PP1CS SoSe 17

MoreRealis+cEs+ma+onsIII●  n-foldcrossvalida+on,typicallyn=10-  par++onthedatainnpar++on-  usen-1par++onsfortraining-  use1par++onforperformanceassessment-  repeatwithadifferenthold-outpar++on-  averageperformance

●  everystepwhereclassinforma+onisconsideredhastobeincludedintheloop!(donearerpar++oning)

PP1CS SoSe 17

Ar+ficialNeuralNetworks●  s+llmostprevalentmachinelearningschemeinbioinforma+cs

●  typicallyFeed-ForwardsMul+-Layer-Perceptrons(hGps://ujjwalkarn.me/2016/08/09/quick-intro-neural-networks/)

●  ErrorBack-Propaga+on(hGp://home.agh.edu.pl/~vlsi/AI/backp_t_en/backprop.html)

PP1CS SoSe 17

Ar+ficialNeuralNetworks●  differentac+va+onfunc+ons

●  ●  considerthenumberoffreeparametersinrespecttothenumberofavailabletraininginstances

●  determinethenumberofepochs

PP1CS SoSe 17

Ar+ficialNeuralNetworks●  bothtoomanyfreeparameters(edges)aswellasovertrainingleadstooverfiing

●  in+alizetheweightswithrandomvaluesfromthelinearregionoftheac+va+onfunc+on

●  repeatseveral+metoavoidgeingstuckinlocalminima

●  learningtheweightsANDdeterminetheop+mumnumberofepochbelongtothetrainingphase(s.t.referredastrainingandcrosstraininingortrainingandvalida+on)

PP1CS SoSe 17

Usecase

●  DevelopapredictorforProteasesbelongingtoacertainfold(3D-structure)

●  CheckPDBforrespec+veentries(structure&func+onannota+on)

●  searchdatabaseforsimilarsequences

●  sanitycheck:predictstructureelements/checkfunc+onannota+ons

●  =>compileposi+vetrainingset

PP1CS SoSe 17

Usecase●  compileanappropriatenega+vetrainingsetwith-  somefoldbutdifferentfunc+on-  samefunc+onbutdifferentfolds-  inreallifeproblems:thisisamajorchallenge

●  decideaboutthecoding(features)andrecodeyourdataset

●  trainyoumethod

●  evaluateyourmethod

●  es+matestability/confidence

PP1CS SoSe 17

Overfiing

●  performancefeature(undesired)●  detec+on:-  performanceontestinstancesissignificantlyworsethanontraininginstances

●  reasons:-  toohighmodelcomplexity(toomanyparameters)-  toomanytrainingepochs(neuralnetworks)

PP1CS SoSe 17

ClassImbalances

●  learningschemestendtominimizethepredic+onerroroverallinstances

●  iftheposi+veclassissmallthenerrorsontheposi+vedoesmaGeranymore(falsenega+ves)

●  solu+ons:-  oversampleminorityclass-  downsamplethemajorityclass-  assignweightstothedifferenterrortypes

PP1CS SoSe 17

HSSPCurve

protein predic+on i for computer scien+sts · machine learning is of immense importance in...

Documents