Exercise–TopicsandScheduleSlot Thursday Tuesday Topic 1 May 4th May 9th Structure of the Exercise / Biological Background 2 May 11th May 16th Biological background 3 May 18th May 23rd Protein structures 4 Jun 1st Jun 13th Alignments 5 Jun 8th Jun 20th Resources for Biological Information / Formats 6 Jun 22nd Jun 27th Machine Learning incl. Tricks / Secondary

Structure Prediction 7 Jun 29th Jul 4th Homology Modeling / Prediction of Other Protein

Features 8 Jul 6th Wrap Up – Questions

WED Jul 12th EXAM

●  amachinelearningdevicecangeneralizefromrealworldobserva+onsintoa“formal”model

●  eachmodelreflectsonlyafewaspectofreality●  nomodelcancompletelyrepresentthereality,i.e.aphotographofadogremainsaphotographandnotarealdog

●  themodelshouldreflectaconceptorcommonali+esandnotindividualcharacteris+cs

●  everylearningschemediscardssomeaspectsofrealitytoconstructamodel

●  thismaydifferbetweendifferentlearningschemes

●  thismightalsoalreadyhappensontheleveloffeatureextrac+on,i.e.choosingthetypesandvaluestorepresentanobserva+on

●  thisisnottobemistakenwiththepredic+vebias

●  learningscheme:aspecificlearningalgorithmproducingamodellikedecisiontrees,rulebasedsystems,SVMs,Bayesiannetworks,etc.

●  aGribute/feature:avariabledescribingaspecificaspectofrealworldobserva+ons,likebodyweight,color,certainpropertyfoundyes/not

●  instance:asingleobserva+ondescribinganobservedeventbyassigningvaluesforeachfeatureusedtorepresentthisobserva+on

●  training:phaseofanalyzingrealworldobserva+onsinaformalizedrepresenta+ontoderiveparametersand/orinternalstructure

●  testphase:phaseofmodelapplica+ontodeterminethereliabilityofstatements(predic+ons)oninstancesnotusedfortraining

●  label:anaGributeselectedtobepredicted

●  dependingonthepresenceofalabelwedis+nguishbetweensupervisedandunsupervisedlearning

●  unsupervisedlearning:conceptlearning,frequentitemsets,clustering

●  supervisedlearning:everythingwithlabeleddatawhichallowstomakeapredic+on

●  FeatureExtrac+on

●  Discre+za+on

●  FeatureSelec+on

●  Conversionofobserva+onrecordsintoaformalized,computer-readablerepresenta+on

●  defini+onofanaGributetype●  assignmentofappropriateaGributevalues

●  thisimpliesastronginvolvementoftheanalyst

●  important:commonsense,backgroundknowledgefromexpertdomains

FeatureSelec+on●  removevaluesfrominstances,i.e.discardsomefeaturesofadatasetbecausetheseare:-  irrelevant-  redundant-  noisy/faulty

●  possiblebenefits:-  improveefficiencyandaccuracy-  preventoverfiing-  savespace

FeatureSelec+onStrategies●  unsupervised:basedondomainknowledge,randomsampling

●  supervised:-  measuresconsidertheclass(filtering):Gini-index,informa+ongain,relief,...

-  usealearningscheme’sperformance(wrapping):●  selectthesetofaGributewhichleadstobestperformance●  forwardselec+on:increasethesetofaGribute1by1●  backwardelemina+on:decreasethesetofaGributes1by1

MachineLearningandBioinforma+cs●  todaybiologyhastospanbetweentwoextremes:-  statementsonthenucleo+delevel(onelevelbelowgenes)

-  statementstheindividual/popula+onlevelontheotherhand

●  thegaininspeedtogeneratesequencedata(nucleo+desequences)hasclearlyoutpacedthespeedofanalysisandknowledgediscovery

●  currentlabtechnologyevencannotfillthegapbetweensequenceandstructure

RoleofDM/MLDataMininghelpsto:●  structurethedataandcompressthedata

●  filteroutmistakesandoutliersbecauseofexperimentalerrorsandothernoise

●  reduceredundancy

●  replacewetlabanalyseswithpredic+ons

●  detectinteres+ngrela+onshipandmodelsanddirectsmanpowertowardspointswhereitisneeded

ly understandable patterns in data (Fayyad,Piatetsky-Shapiro, and Smyth 1996).

Here, data are a set of facts (for example,cases in a database), and pattern is an expres-sion in some language describing a subset ofthe data or a model applicable to the subset.Hence, in our usage here, extracting a patternalso designates fitting a model to data; find-ing structure from data; or, in general, mak-ing any high-level description of a set of data.The term process implies that KDD comprisesmany steps, which involve data preparation,search for patterns, knowledge evaluation,and refinement, all repeated in multiple itera-tions. By nontrivial, we mean that somesearch or inference is involved; that is, it isnot a straightforward computation ofpredefined quantities like computing the av-erage value of a set of numbers.

The discovered patterns should be valid onnew data with some degree of certainty. Wealso want patterns to be novel (at least to thesystem and preferably to the user) and poten-tially useful, that is, lead to some benefit tothe user or task. Finally, the patterns shouldbe understandable, if not immediately thenafter some postprocessing.

The previous discussion implies that we candefine quantitative measures for evaluatingextracted patterns. In many cases, it is possi-ble to define measures of certainty (for exam-ple, estimated prediction accuracy on new

data) or utility (for example, gain, perhaps indollars saved because of better predictions orspeedup in response time of a system). No-tions such as novelty and understandabilityare much more subjective. In certain contexts,understandability can be estimated by sim-plicity (for example, the number of bits to de-scribe a pattern). An important notion, calledinterestingness (for example, see Silberschatzand Tuzhilin [1995] and Piatetsky-Shapiro andMatheus [1994]), is usually taken as an overallmeasure of pattern value, combining validity,novelty, usefulness, and simplicity. Interest-ingness functions can be defined explicitly orcan be manifested implicitly through an or-dering placed by the KDD system on the dis-covered patterns or models.

Given these notions, we can consider apattern to be knowledge if it exceeds some in-terestingness threshold, which is by nomeans an attempt to define knowledge in thephilosophical or even the popular view. As amatter of fact, knowledge in this definition ispurely user oriented and domain specific andis determined by whatever functions andthresholds the user chooses.

Data mining is a step in the KDD processthat consists of applying data analysis anddiscovery algorithms that, under acceptablecomputational efficiency limitations, pro-duce a particular enumeration of patterns (ormodels) over the data. Note that the space of


Data Mining

Interpretation / �Evaluation



--- --- ------ --- ------ --- ---


Preprocessed Data

Target Date

taken from U. Fayyad, G. Piatetsky-Shapiro, P. Smyth "From Data Mining to Knowledge Discovery in Databases" (1996) AI Magazine, 17, 37-54

L. J. Jensen & A. Bateman, 2011. Bioinformatics, 27(24), 3331-2

●  theyarecapabletohandleahugenumberofaGributes

●  theyarequiterobustagainstuninforma+vefeatures

●  theyimplicitlyadjustfeatureweightsduringthetrainingphase

●  theyworksufficientlywell

●  Youdonotneedtohaveanideaaboutthemeaningofaninput

●  i.e.nobackgroundknowledgeorunderstandingforfeatureselec+onorevenstrongerforfeaturegenera+onnecessary

●  Disadvantage:Thesemethodsare“blackbox”models,soinspec+ngthemodeldoesnotreallyincreaseyouknowledge/understanding

●  thereareanumberofassump+onsinthevariousprocessingsteps

●  theperformancedependsonthattheseassump+onshold

●  veryorenwecannotreallycheckorproofifthisistrue

●  weassumethatthebackgrounddistribu+onisuniform

●  i.e.theunderlyingsourceemitsinstanceswithconstantprobabili+esover+me

●  possiblesolu+ons:-  usemanyfeaturestorepresentcomplexscenarios-  usestreamminingalgorithmswhichupdateparameters

x x

x x


x x


●  dueto“experimental”reasonsthesamplerepresentsonlyaspecialsubsetoftheen++es

●  especiallydifficultforlazylearningmethodslikek-nearest-neighbors

●  possiblesolu+ons:-  removeredundancy-  usestra+fica+on-  checkvarianceandiden+fydifficultinstances

x x x x

x x

x x

x x

x x

x x

x x


●  thecollec+onofthedataistypicallygovernedbyaspecificresearchtask

●  thesamplingofthe“global”distribu+onisnotfair

●  modelstrytominimizetheerrorOVERALLinstances

●  stay“local”withyourpredic+ons(knowyourlimits)

●  applyredundancyreduc+ontomakethedataa“fair”sample

●  CD-HIT:clusterssequencesaccordingtoausergiventhreshold(CD-HIT:acceleratedforclusteringthenextgenera+onsequencingdata.LiminFu,BeifangNiu,ZhengweiZhu,SitaoWuandWeizhongLi,Bioinforma+cs(2012)28:3150-3152,doi:10.1093/bioinforma+cs/bts565)

●  UniqueProt:createsrepresenta+ve,unbiasedsetsofproteinsequencesbasedonHSSPvalues(Mika&Rost,NucleicAcidsRes.2003Jul1;31(13):3789–3791.,hGps://

SuitableRepresenta+on●  weassumethattheselectedfeaturesetcanrepresenttheconcepttolearn

●  sinceweorendonotknowcausalrela+onshipswemightmistakeemployedfeatureswithrealcausingones

●  e.g.numberofstorksandbirthsinGermany

●  e.g.openingumbrellasandrain

●  remedy:usebackgroundknowledge,carefulinterpreta+on

Weather Temperature Wind playTennis rainy high no yes dry medium yes yes rainy medium yes no dry medium no no

•  with this attributes it is really hard to learn favorable conditions for playing tennis

•  we unconsciously assume that these attribute are sufficient to describe the scenario

Weather Temperature Wind Buddy playTennis rainy high no yes yes dry medium yes yes yes rainy medium yes no no dry medium no no no

•  actually there no proof to show whether important attributes are missing or not

•  for this aspect background knowledge and experience are most important

PerformanceEs+ma+on●  trainingamodelmeansdiscardingandkeepingsomeinforma+onfromtheobservedinstances

●  dependingonthelearningschemeinstancespecificinforma+onisstoredtoo

●  instancespecificinforma+onleadstooverfiing

●  overfiing:apredic+onmodelisbiasedtowardsthetrainingexamples,i.e.:-  beGerperformanceontrainingexamples-  worseperformanceonnewinstances

MoreRealis+cEs+ma+ons●  mostop+mis+ces+ma+on:Resubsitu+onerror(determineperformanceonthesetcompletelyusedfortraining)

●  ifyouhavealotofdata:determinetheerroronanindependenttestset

MoreRealis+cEs+ma+onsII●  LOOCV:Leaveonoutcrossvalida+on-  alwaysoneexampleisholdoutfortes+ng,theremainingfortraining

-  nitera+onswithninstances,finalresultistheaverage

-  s+llquitebiased-  usetochecktheinfluenceofindividualinstances-  ifyouhaveasmallnumberofinstances

MoreRealis+cEs+ma+onsIII●  n-foldcrossvalida+on,typicallyn=10-  par++onthedatainnpar++on-  usen-1par++onsfortraining-  use1par++onforperformanceassessment-  repeatwithadifferenthold-outpar++on-  averageperformance

●  everystepwhereclassinforma+onisconsideredhastobeincludedintheloop!(donearerpar++oning)

Ar+ficialNeuralNetworks●  s+llmostprevalentmachinelearningschemeinbioinforma+cs

●  typicallyFeed-ForwardsMul+-Layer-Perceptrons(hGps://

●  ErrorBack-Propaga+on(hGp://

Ar+ficialNeuralNetworks●  differentac+va+onfunc+ons

●  ●  considerthenumberoffreeparametersinrespecttothenumberofavailabletraininginstances

●  determinethenumberofepochs

Ar+ficialNeuralNetworks●  bothtoomanyfreeparameters(edges)aswellasovertrainingleadstooverfiing

●  in+alizetheweightswithrandomvaluesfromthelinearregionoftheac+va+onfunc+on

●  repeatseveral+metoavoidgeingstuckinlocalminima

●  learningtheweightsANDdeterminetheop+mumnumberofepochbelongtothetrainingphase(s.t.referredastrainingandcrosstraininingortrainingandvalida+on)

●  DevelopapredictorforProteasesbelongingtoacertainfold(3D-structure)

●  CheckPDBforrespec+veentries(structure&func+onannota+on)

●  searchdatabaseforsimilarsequences

●  sanitycheck:predictstructureelements/checkfunc+onannota+ons

●  =>compileposi+vetrainingset

Usecase●  compileanappropriatenega+vetrainingsetwith-  somefoldbutdifferentfunc+on-  samefunc+onbutdifferentfolds-  inreallifeproblems:thisisamajorchallenge

●  decideaboutthecoding(features)andrecodeyourdataset

●  trainyoumethod

●  evaluateyourmethod

●  es+matestability/confidence

●  performancefeature(undesired)●  detec+on:-  performanceontestinstancesissignificantlyworsethanontraininginstances

●  reasons:-  toohighmodelcomplexity(toomanyparameters)-  toomanytrainingepochs(neuralnetworks)

●  learningschemestendtominimizethepredic+onerroroverallinstances

●  iftheposi+veclassissmallthenerrorsontheposi+vedoesmaGeranymore(falsenega+ves)

●  solu+ons:-  oversampleminorityclass-  downsamplethemajorityclass-  assignweightstothedifferenterrortypes

