ml4bio’ lecture’#1:’introduc3on’ · 2016. 2. 29. · course’goals’ •...

64
ML4Bio Lecture #1: Introduc3on February 24 th , 2016 Quaid Morris

Upload: others

Post on 13-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

ML4Bio  Lecture  #1:  Introduc3on  

 February  24th,  2016  

Quaid  Morris  

Page 2: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

Course  goals  

•  Prac3cal  introduc3on  to  ML  – Having  a  basic  grounding  in  the  terminology  and  important  concepts  in  ML;  to  permit  self-­‐study,  

– Being  able  to  select  the  right  method  for  your  problems,  

– Being  able  to  use  the  mul3tude  of  ML  tools  and  methods  in  R,  

– Being  able  to  troubleshoot  problems  with  tools,  – Having  a  founda3on  to  learn  other  tools:  Python’s  scikit-­‐learn,  TensorFlow/Torch/Theano  

Page 3: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

How  this  course  works  

•  Course  website:  Google  “ML4BIO  Alan  Moses”  •  Course  email:    (but  email  me  if  you  have  ques3ons)  

•  Four  problem  sets,  25%  of  your  grade  •  Programming  in  R  (other  languages  possible*,  but  unsupported)  

•  Two  tutorials:  linear  algebra  review  (March  1st),  intro  to  R(March  8th),  details  on  website.  

Page 4: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

Outline  

•  Overview  of  ML  •  Overfi^ng  •  Cross-­‐valida3on  •  Measuring  success  

Page 5: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

Probabilis3c  Modelling  and    Bayesian  Inference  

Zoubin Ghahramani

Department of Engineering University of Cambridge, UK

[email protected] h t t p : / / l e a r n i n g . e n g . c a m . a c . u k / z o u b i n /

MLSS Tu¨bingen Lectures 2013

Some  slides  adapted  from:  

Page 6: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

x o

x x

x x

o

o

o o o

x x

x

x x

x o o

? ?

?

?

Classifica3on  example  

What  is  the  correct  label  for  the  ?’s?  How  certain  am  I?  How  does  the  label  depend  on  x?      

x1  

x2  e.g.  is  this  cell  a  T-­‐cell?  

Image from Z. Ghahramani  

Page 7: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

Regression  example  

0 2 4 6 8 10 −20

−10

0

10

20

30

40

50

60

x  

y  

What  is  the  rela3onship  between  x  and  y?  Given  a  new  value  of  x,  what’s  my  best  guess  at  y?  What  is  the  range  of  variability  in  y  for  a  given  x?        

e.g.  what  is  the  temperature  at  this  3me  of  the  year?  

Image from Z. Ghahramani  

Page 8: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

Clustering  example  

Are  a  given  pair  of  datapoints  from  the  same  cluster?  How  many  clusters  are  there?  What  are  the  characteris3cs  of  individual  clusters?  Are  there  outliers?    How  certain  am  I  of  the  answers  to  the  above  ques3ons?  

Image from Z. Ghahramani  

Page 9: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

Dimensionality  reduc3on  example  

Image from PMID: 22595000  

Do  the  datapoints  lie  on  a  lower  dimensional  “manifold”?  If  so,  what  is  the  dimensionality?  How  far  apart  are  two  datapoints,  if  you  can  only  travel  on  the  manifold?  

Page 10: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

Dimensionality  reduc3on  example  II  

From  “monocle”:  Trapnell  et  al,  NatBio  2014    

Page 11: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

How  do  to  ML  •  Four  parts:  

1.  Data  D  –  describes  the  machine  learning  problem  2.  Model  –  defines  the  parameters  Θ  and  describes  

how  the  data  D  depends  on  them  3.  Objec3ve  func3on  E(Θ,  D)  –  scores  Θ  for  a  given  

dataset  D  4.  Op3miza3on  method  –  finds  high  scoring  values  

of  Θ  

Page 12: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

The  Data  

D  =  {(x(1),y(1)),  (x(2),y(2)),  …,  (x(N),y(N))}    

D  =  {x(1),  x(2),  …,  x(N)}    

Targets  (regression)  or  labels  (classifica=on)  

Supervised  learning:  

Inputs  (aka  features)  

Unsupervised  learning:  

e.g.  deep  learning,  random  forests,  SVMs  

e.g.  clustering,  PCA,  dimensionality  reduc3on  

Page 13: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

Data:  supervised  learning  

x o

x x

x x

o

o

o o o

x x

x

x x

x o o

? ?

?

?

x1  

x2  

0 2 4 6 8 10 −20

−10

0

10

20

30

40

50

60

x  

y  

•  y(n)  is  a  categorical  value,  some3mes  called  “discrete”  

•  If  y(n)  is  either  X  or  O:  binary  classifica3on  •  If  y(n)  is,  e.g.,    either  X,  O,  or  +:  mul3class  classifica3on  •  If  y(n)  is,  e.g.,    either  (X,  X),  (O,  X),  (X,  O)  or  (O,O):  

mul3label  classifica3on    

y(n)  is  a  con3nuous  value,                      aka  a  real  number  

Page 14: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

Data:  unsupervised  learning  

Image from PMID: 22595000  

Page 15: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

The  Model  

•  A  formal  descrip3on  of  how  the  data  depends  on  the  parameter.  

•  E.g.  linear  regression.  Data:  inputs  x,  and  target  values  y  

         Goal:  set  α  and  β  so  that  y(n)  is  as  close  as  possible  to  y(n)  for  a  given  x(n)  for  all  i    

y  =  α  x  +  β  ⌃  

Inputs  aka  features  

Model’s  predic3on,  aka  output:  “y  hat”,  this  is  compared  to  target  value  “y”  

model  parameters  

⌃  

Page 16: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

e.g.  Classifica3on  

x o

x x

x x

o

o

o o o

x x

x

x x

x o o

Model  (logis3c  regression):    1.  the  “direc3on”  of  the  classifica3on  boundary,  given  by  a  

vector  ω  2.  A  scalar  value,  γ,  indica3ng  how  quickly  confidence  

changes  as  we  move  away  from  the  boundary  

Page 17: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

e.g.  Polynomial  regression  

0 2 4 6 8 10 −20

−10

0

10

20

30

40

50

60

x  

y  

Model  (e.g.  fish  order  polynomial):    

y  =  θ5  x5  +θ4  x4  +  θ3  x3  +  θ2  x2  +  θ1  x1  +  θ0    ⌃  

Page 18: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

The  Objec3ve  func3on  E(Θ,  D)  or  just  E(Θ)    

•  Depends  on  both  the  data  D  and  the  parameters  Θ,  •  Measures  fit  between  the  model’s  predic3ons  and  the  data,  

•  Osen  contains  a  term  to  penalize  “complex  models”,  some3mes  known  as  regulariza3on  

•  D  is  fixed,  goal  is  to  op3mize  func3on  with  respect  to  Θ,  •  Also  known  as:  cost  func3on,  error  func3on  •  Examples:  sum  of  squared  errors  (SSE):  

  E(α,  β)  =    Σn  (y(n)  -­‐  y(n))2  ⌃  

Page 19: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

The  Op3miza3on  Method  

•  E.g.  try  all  values  of  α,  β  on  a  grid  un3l,  choose  the  best  ones.  (brute  force)  

•  Take  the  par3al  deriva3ves  of  E  with  respect  to  α,  β  and  find  the  cri3cal  points  where  they  are  zero,  determine  which  are  minima.  (analy3cal)  

•  Start  at  a  random  point,  follow  the  gradient  of  the  func3on  to  a  (local)  minimum  (gradient  descent)  

E(α,  β)  =    Σn  (y(n)  -­‐  y(n))2  ⌃  

Page 20: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

Important  ques3ons  1.  Data  –  Is  the  data  appropriate  for  my  learning  task?  

Do  I  have  enough  data?  Are  my  training  data  representa3ve?  Is  there  a  selec3on  bias  in  my  data?  

2.  Model  –  Is  my  model  sufficiently  complex  to  learn  the  task?  Is  overfi^ng  a  concern?  Can  I  interpret  the  parameters  or  the  model?  

3.  Objec3ve  func3on  –  Does  my  objec3ve  func3on  score  errors  appropriately?  Is  it  too  sensi3ve  to  outliers?  Is  it  properly  regularized?  

4.  Op3miza3on  method  –  Will  my  op3miza3on  method  find  good  solu3ons?  Does  it  get  stuck  in  subop3mal  solu3on  (aka  local  minima)?  Am  I  using  a  method  matched  to  the  objec3ve  func3on?  Is  it  fast  enough?  

Page 21: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

Important concepts

•  Training and test sets •  Uncertainty about classification •  Overfitting •  Cross-validation (leave-one-out)

Page 22: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

Put yourself in the machine’s shoes

Which  uncharacterized  genes  are  involved  in  tRNA  processing?  

Gene1  

Gene2  

Gene3  

Gene4  

Gene5  

Gene6  

Feature1  

Feature2  

Feature3  

Feature4  

Feature5  

Expression  level  during    heat  shock  

This  colour  is  defined  as  “blue”  

Page 23: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

Nega3ves  Posi3ves  

Training  

What  pawern  dis3nguishes  the  posi3ves  and  the  nega3ves?    

Page 24: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

Nega3ves  Posi3ves  

•   4  blue  features  

•   features  1,3,  and  5  are  blue  

•   features  1  and  3  are  blue  and  feature  2  is  red  

•   features  1  and  3  are  blue    

Training  

Page 25: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

•   features  1  and  3  are  blue  and  feature  2  is  red  

•   features  1  and  3  are  blue    

Nega3ves  Posi3ves  

Known  genes  

Training  

Page 26: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

Nega3ves  Posi3ves  

Known  genes  

Training  

•   features  1  and  3  are  blue  

Page 27: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

Prediction

Which  genes  are  involved  in  tRNA  processing?  

Gene1  

Gene2  

Gene3  

Gene4  

Gene5  

Gene6  

Unknowns  

Page 28: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

Prediction

Which  genes  are  involved  in  tRNA  processing?  

Gene1  

Gene2  

Gene3  

Gene4  

Gene5  

Gene6  

Yes  

Yes  

Yes  

No  

No  

No  

Features  1  and  3  blue?  

Feature1  

Feature3  

Page 29: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

Prediction

Which  genes  are  involved  in  tRNA  processing?  

Gene1  

Gene2  

Gene3  

Gene4  

Gene5  

Gene6  

Yes  

Yes  

Yes  

No  

No  

No  

Features  1  and  3  blue?  

Feature1  

Feature3  

Predic3on:  

Involved  

Not  Involved  

Involved  

Involved  

Not  Involved  

Not  Involved  

Assay:  

+  

+  -­‐  +  

-­‐  -­‐  

Page 30: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

Nega3ves  Posi3ves  

•   4  blue  features  

•   features  1  and  3  are  blue    

Training  under  sparse  annota3on  

What  pawern  dis3nguishes  the  posi3ves  and  the  nega3ves?    

Page 31: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

Prediction under sparse annotation

Gene1  

Gene2  

Gene3  

Gene4  

Gene5  Gene6  

Yes  

Yes  

Yes  

No  No  

No  

Features  1  and  3  blue?  

Feature1  

Feature3  

Four  blue  features?  

Yes  

No  

No  

No  Yes  

No  

Which  genes  are  involved  in  tRNA  processing?  

Page 32: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

Prediction under sparse annotation

Gene1  

Gene2  

Gene3  

Gene4  

Gene5  Gene6  

Yes  

Yes  

Yes  

No  No  

No  

Features  1  and  3  blue?  

Feature1  

Feature3  

Four  blue  features?  

Yes  

No  

No  

No  Yes  

No  

Confidence  

1.0  

0.5  

0  

0.5  

0.5  0  

1.0  0.5  0  

Definitely  involved  May  be  involved  Definitely  not  involved  

Legend  

Page 33: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

Prediction under sparse annotation

Gene1  

Gene2  

Gene3  

Gene4  

Gene5  Gene6  

Yes  

Yes  

Yes  

No  No  

No  

Features  1  and  3  blue?  

Feature1  

Feature3  

Four  blue  features?  

Yes  

No  

No  

No  Yes  

No  

Confidence  

1.0  

0.5  

0  

0.5  

0.5  0  

Predic<on:    Gene1,  and  probably  Genes  2,  4,  and  5  are  involved  in  tRNA  processing.  

Page 34: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

Experimental validation

Gene1  

Gene2  

Gene3  

Gene4  

Gene5  Gene6  

Confidence  

1.0  

0.5  

0  

0.5  

0.5  0  

Label  

+  

+  -­‐  +  

-­‐  -­‐  

Page 35: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

Experimental validation

Gene1  

Gene2  

Gene3  

Gene4  

Gene5  Gene6  

Confidence  

1.0  

0.5  

0  

0.5  

0.5  0  

One  correct  “confidence  1”  predic3on  

Page 36: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

Experimental validation

Gene1  

Gene2  

Gene3  

Gene4  

Gene5  Gene6  

Confidence  

1.0  

0.5  

0  

0.5  

0.5  0  

Two  out  of  three  “confidence  0.5”  predic3ons  correct.  

Page 37: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

#  True

 Po

si3ves  

Confi

dence  

Cutoff  

#  False

 Po

si3ves  

Validation results

Gene1  

Gene2  

Gene3  

Gene4  

Gene5  Gene6  

Confidence  

1.0  

0.5  

0  

0.5  

0.5  0  

1 1 0

0.5 3 1

0 3 3

Page 38: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

Nega3ves  Posi3ves  

Incorrect  measurement,  should  be  blue.  

Noisy  features  

Page 39: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

Nega3ves  Posi3ves  

Noisy  features  

What  dis3nguishes  the  posi3ves  and  the  nega3ves?    

Page 40: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

Nega3ves  Posi3ves  

Noisy  features  +  sparse  data  =  overfi^ng  

What  dis3nguishes  the  posi3ves  and  the  nega3ves?    

Page 41: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

Nega3ves  Posi3ves  

Training  

•   4  blue  features    

Page 42: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

Prediction Four  blue  features?  

Yes  

No  

No  

No  Yes  

No  

Gene1  

Gene2  

Gene3  

Gene4  

Gene5  Gene6  

Confidence  

1.0  

0  

0  

0  

1.0  0  

Predic<on:  Gene1  and  5  are  involved  in  tRNA  processing.  

Page 43: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

Four  blue  features?  

Yes  

No  

No  

No  Yes  

No  

Gene1  

Gene2  

Gene3  

Gene4  

Gene5  Gene6  

Confidence  

1.0  

0  

0  

0  

1.0  0  

One  incorrect  high  confidence  predic3on,  i.e.,  one  false  posi3ve  

Experimental validation

Page 44: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

Four  blue  features?  

Yes  

No  

No  

No  Yes  

No  

Gene1  

Gene2  

Gene3  

Gene4  

Gene5  Gene6  

Confidence  

1.0  

0  

0  

0  

1.0  0  

Two  genes  missed  completely,  i.e.,  two  false  nega3ves  

Experimental validation

Page 45: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

Four  blue  features?  

Yes  

No  

No  

No  Yes  

No  

Gene1  

Gene2  

Gene3  

Gene4  

Gene5  Gene6  

Confidence  

1.0  

0  

0  

0  

1.0  0  

One  incorrect  high  confidence  predic3on,  two  genes  missed  completely  

Experimental validation

Page 46: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

1 1 1

0 3 3

#  True

 Posi3ves  

Confi

dence  

Cutoff  

#  False

 Posi3ves  

Validation results

Gene1  

Gene2  

Gene3  

Gene4  

Gene5  Gene6  

Confidence  

1.0  

0  

0  

0  

1.0  0  

Page 47: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

What have we learned?

•  Sparse data: many different patterns distinguish positives and negatives.

•  Noisy features: Actual distinguishing pattern may not be observable

•  Sparse data + noisy features: may detect, and be highly confident in, spurious, incorrect patterns.

Overfi^ng  

Page 48: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

Validation

•  Different algorithms assign confidence to their predictions differently

•  Need to 1.  Determine meaning of each algorithm’s

confidence score. 2.  Determine what level of confidence is

warranted by the data

Page 49: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

Cross-validation

Basic  idea:  

Hold  out  part  of  the  data  and  use  it  to  validate  confidence  levels  

Page 50: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

Nega3ves  Posi3ves  

Cross-­‐valida3on  

Page 51: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

Nega3ves  Posi3ves  

Cross-­‐valida3on  

Hold-­‐out   Labe

l  

+  

+  

-­‐  

Page 52: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

Nega3ves  Posi3ves  

Cross-­‐valida3on:  training  

Page 53: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

Nega3ves  Posi3ves  

Cross-­‐valida3on:  training  

•   Features  1  and  3  are  blue    

Page 54: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

Cross-­‐valida3on:  tes3ng  

Hold-­‐out  Features  1  and  3  blue?  

Yes  

No  

No  

Page 55: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

Cross-­‐valida3on:  tes3ng  

Hold-­‐out  Features  1  and  3  blue?  

Yes  

No  

No  

Confi

dence  

1.0  

0  

0  

Page 56: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

Cross-­‐valida3on:  tes3ng  

Hold-­‐out   Labe

l  

+  

+  

-­‐  

Features  1  and  3  blue?  

Yes  

No  

No  

Confi

dence  

1.0  

0  

0  

Page 57: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

Cross-­‐valida3on:  tes3ng  

Hold-­‐out   Labe

l  

+  

+  

-­‐  

Confi

dence  

1.0  

0  

0  1 1 0

0 2 1

#  True

 Posi3ves  

Confi

dence  

cutoff  

#  False

 Posi3ves  

Page 58: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

N-fold cross validation Step  1:  Randomly  reorder  rows  

Labelled  data  

Step  2:  Split  into  N  sets    (e.g.  N  =  5)  

Step  3:  Train  N  3mes,  using  each  split,  in  turn,  as  the  hold-­‐out  set  

Permuted  data  Training  splits  

+  

-­‐  

-­‐  

Page 59: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

N-fold cross validation Step  1:  Randomly  reorder  rows  

Labelled  data  

Step  2:  Split  into  N  sets    (e.g.  N  =  5)  

Step  3:  Train  N  3mes,  using  each  split,  in  turn,  as  the  hold-­‐out  set  

Permuted  data  Training  splits  

+  

-­‐  

-­‐  

Page 60: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

Cross-­‐valida3on  results  

#  True

 Po

si3ves  

Confi

dence  

cutoff  

#  False

 Posi3ves  

1 3 0

0.75 3 1

0.5 4 2

0.25 5 3

0 5 5

Page 61: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

Displaying results: ROC curves

#  True

 Po

si3ves  

Confi

dence  

cutoff  

#  False

 Po

si3ves  

1 3 0

0.75 3 1

0.5 4 2

0.25 5 3

0 5 5

#  TP  

#  FP  

5  

5  1  

1  

ROC  curve  

Page 62: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

Making new predictions

1 3 0

0.75 3 1

0.5 4 2

0.25 5 3

0 5 5

#  TP  

#  FP  

5  

5  1  

1  

ROC  curve  

x  

#  True

 Po

si3ves  

Confi

dence  

cutoff  

#  False

 Po

si3ves  

Page 63: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

Figures of merit

Precision:    #TP  /  (#TP  +  #FP)    (also  known  as  posi=ve  predic=ve  value)    Recall:  #TP  /  (#TP  +  #FN)    (also  known  as  sensi=vity)    Specificity:  #TN  /  (#FP  +  #TN)    Nega=ve  predic=ve  value:  #TN  /  (#FN  +  #TN)    Accuracy:  (#TP  +  #TN)  /  (#TP  +  #FP  +  #TN  +  #FN)      

Predicted  

Actual  

T   F  

T  F  

TP  

FP   TN  

FN  

Confusion  matrix  

Page 64: ML4Bio’ Lecture’#1:’Introduc3on’ · 2016. 2. 29. · Course’goals’ • Prac3cal’introduc3on’to’ML’ – Having’abasic’grounding’in’the’terminology’and’

Area under the ROC curve

Sensi3vity  

1-­‐Specificity  

1  

1  0  

ROC  curve  

0  

Area  Under  the  ROC  Curve  (AUC)  =      Average  propor3on  of  nega3ves  with        confidence  levels  less  than  a  random      posi3ve  

Quick  facts:    -­‐  0  <  AUC  <  1  -­‐  AUC  of  random  classifier  =  0.5