deep learning for vision - cse.msu.educse803/lectures/deeplearningtutorial.pptx.pdf · deep...

Post on 20-May-2020

12 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Deep Learning for Vision

Xiaoming  Liu        

Slides  adapted  from  Adam  Coates  

2  

What do we want ML to do?

•  Given  image,  predict  complex  high-­‐level  pa>erns:  

Object  recogniCon   DetecCon   SegmentaCon  

“Cat”  

[MarCn  et  al.,  2001]  

3  

How is ML done?

•  Machine  learning  oNen  uses  common  pipeline  with  hand-­‐designed  feature  extracCon.  •  Final  ML  algorithm  learns  to  make  decisions  starCng  from  

the  higher-­‐level  representaCon.  •  SomeCmes  layers  of  increasingly  high-­‐level  abstracCons.  

–  Constructed  using  prior  knowledge  about  problem  domain.  

Feature  ExtracCon   Machine  Learning  Algorithm   “Cat”?  

Prior  Knowledge,  Experience  

4  

“Deep Learning”

•  Deep  Learning  •  Train  mul$ple  layers  of  features/abstracCons  from  data.  •  Try  to  discover  representa$on  that  makes  decisions  easy.  

Low-­‐level  Features  

Mid-­‐level  Features  

High-­‐level  Features   Classifier   “Cat”?  

Deep  Learning:    train  layers  of  features  so  that  classifier  works  well.  

More  abstract  representaCon  

5  

“Deep Learning”

•  Why  do  we  want  “deep  learning”?  –  Some  decisions  require  many  stages  of  processing.  

•  Easy  to  invent  cases  where  a  “deep”  model  is  compact  but  a  shallow  model  is  very  large  /  inefficient.  

– We  already,  intuiCvely,  hand-­‐engineer  “layers”  of  representaCon.  •  Let’s  replace  this  with  something  automated!  

– Algorithms  scale  well  with  data  and  compuCng  power.  •  In  pracCce,  one  of  the  most  consistently  successful  ways  to  get  good  results  in  ML.  

•  Can  try  to  take  advantage  of  unlabeled  data  to  learn  representaCons  before  the  task.  

6  

Have we been here before? Ø Yes.  –  Basic  ideas  common  to  past  ML  and  neural  networks  research.  •  Supervised  learning  is  straight-­‐forward.  •  Standard  ML  development  strategies  sCll  relevant.  •  Some  knowledge  carried  over  from  problem  domains.  

Ø No.  –  Faster  computers;    more  data.  –  Be>er  opCmizers;    be>er  iniCalizaCon  schemes.  

•  “Unsupervised  pre-­‐training”  trick    [Hinton  et  al.  2006;    Bengio  et  al.  2006]  

–  Lots  of  empirical  evidence  about  what  works.  •  Made  useful  by  ability  to  “mix  and  match”  components.  [See,  e.g.,  Jarre>  et  al.,  ICCV  2009]  

7  

Real impact

•  DL  systems  are  high  performers  in  many  tasks  over  many  domains.  

Image  recogniCon  [E.g.,  Krizhevsky  et  al.,  2012]  

Speech  recogniCon  [E.g.,  Heigold  et  al.,  2013]  

NLP  [E.g.,  Socher  et  al.,  ICML  2011;  Collobert  &  Weston,  ICML  2008]  

[Honglak  Lee]  

8  

Outline •  ML  refresher  /  crash  course  

–  LogisCc  regression  –  OpCmizaCon  –  Features  

•  Supervised  deep  learning  –  Neural  network  models  –  Back-­‐propagaCon  –  Training  procedures  

•  Supervised  DL  for  images  –  Neural  network  architectures  for  images.  –  ApplicaCon  to  Image-­‐Net  

•  References  /  Resources  

MACHINE LEARNING REFRESHER

Crash  Course  

10  

Supervised Learning

•  Given  labeled  training  examples:  

•  For  instance:    x(i)  =  vector  of  pixel  intensiCes.                                                    y(i)  =  object  class  ID.  

•  Goal:    find  f(x)  to  predict  y  from  x  on  training  data.  – Hopefully:    learned  predictor  works  on  “test”  data.  

255  98  93  87  …  

f(x)   y  =  1      (“Cat”)  

11  

Logistic Regression •  Simple  binary  classificaCon  algorithm  – Start  with  a  funcCon  of  the  form:  

–  InterpretaCon:    f(x)  is  probability  that  y  =  1.  •  Sigmoid  “nonlinearity”  squashes  linear  funcCon  to  [0,1].  

– Find  choice  of          that  minimizes  objecCve:  

1  

12  

Optimization

•  How  do  we  tune          to  minimize                ?  •  One  algorithm:    gradient  descent  –  Compute  gradient:  

–  Follow  gradient  “downhill”:  

 •  StochasCc  Gradient  Descent  (SGD):    take  step  using  gradient  from  only  small  batch  of  examples.  –  Scales  to  larger  datasets.    [Bo>ou  &  LeCun,  2005]  

 

13  

Is this enough?

•  Loss  is  convex  à  we  always  find  minimum.  •  Works  for  simple  problems:  –  Classify  digits  as  0  or  1  using  pixel  intensity.  –  Certain  pixels  are  highly  informaCve  -­‐-­‐-­‐  e.g.,  center  pixel.  

 •  Fails  for  even  slightly  harder  problems.  –  Is  this  a  coffee  mug?  

14  

Why is vision so hard?

“Coffee  Mug”  

Pixel  Intensity  

Pixel  intensity  is  a  very  poor  representaCon.  

15  

pixel  1  

pixel  2  

+   Coffee  Mug  

Not  Coffee  Mug  -­‐  

Why is vision so hard?

+  -­‐  

-­‐  

pixel  1  

pixel  2  

+  

Pixel  Intensity  [72    160]  

16  

Why is vision so hard?

+  

pixel  1  

pixel  2  

-­‐  +  

+  

-­‐  -­‐  

+   -­‐  +  

+   Coffee  Mug  

Not  Coffee  Mug  -­‐  

+  

pixel  1  

pixel  2  

-­‐  +  

+  

-­‐  -­‐  

+   -­‐  +  

Is  this  a  Coffee  Mug?  

Learning  Algorithm  

17  

Features

+  handle?  

cylinder?  

-­‐  +  +  -­‐  -­‐  

+  

-­‐   +  

+   Coffee  Mug  

Not  Coffee  Mug  -­‐  

cylinder?  handle?  

Is  this  a  Coffee  Mug?  

Learning  Algorithm   +  handle?  

cylinder?  

-­‐  +  +  -­‐  -­‐  

+  

-­‐   +  

18  

Features •  Features  are  usually  hard-­‐wired  transformaCons  built  into  the  system.  – Formally,  a  funcCon  that  maps  raw  input  to  a  “higher  level”  representaCon.  

– Completely  staCc  -­‐-­‐-­‐  so  just  subsCtute  φ(x)  for  x  and  do  logisCc  regression  like  before.  

Where  do  we  get  good  features?  

19  

Features

•  Huge  investment  devoted  to  building  applicaCon-­‐specific  feature  representaCons.  –  Find  higher-­‐level  pa>erns  so  that  final  decision  is  easy  to  learn  with  ML  algorithm.  

Object  Bank  [Li  et  al.,  2010]   Super-­‐pixels  [Gould  et  al.,  2008;    Ren  &  Malik,  2003]  

SIFT  [Lowe,  1999]   Spin  Images  [Johnson  &  Hebert,  1999]  

SUPERVISED DEEP LEARNING

Extension  to  neural  networks  

21  

Basic idea

•  We  saw  how  to  do  supervised  learning  when  the  “features”  φ(x)  are  fixed.  – Let’s  extend  to  case  where  features  are  given  by  tunable  funcCons  with  their  own  parameters.  

Inputs  are  “features”-­‐-­‐-­‐one  feature  for  each  row  of  W:  Outer  part  of  funcCon  is  same  

as  logisCc  regression.  

22  

Basic idea

•  To  do  supervised  learning  for  two-­‐class  classificaCon,  minimize:  

 •  Same  as  logisCc  regression,  but  now  f(x)  has  mulCple  stages  (“layers”,  “modules”):  

Intermediate  representaCon  (“features”)   PredicCon  for  

23  

Neural network

•  This  model  is  a  sigmoid  “neural  network”:  

Flow  of  computaCon.  “Forward  prop”  

“Neuron”  

24  

Neural network •  Can  stack  up  several  layers:   Must  learn  mulCple  stages  

of  internal  “representaCon”.  

25  

Back-propagation

•  Minimize:  

•  To  minimize                            we  need  gradients:  

– Then  use  gradient  descent  algorithm  as  before.  

•  Formula  for                                  can  be  found  by  hand  (same  as  before);    but  what  about  W?  

26  

The Chain Rule •  Suppose  we  have  a  module  that  looks  like:  

•  And  we  know                                                                and                ,  chain  rule  gives:  

 

Similarly  for  W:  

Ø   Given  gradient  with  respect  to  output,  we  can  build  a  new  “module”  that  finds  gradient  with  respect  to  inputs.      

Jacobian  matrix.  

27  

The Chain Rule •  Easy  to  build  toolkit  of  known  rules  to  compute  gradients  given  –  Automated  differenCaCon!    E.g.,  Theano  [Bergstra  et  al.,  2010]  

FuncCon   Gradient  w.r.t.  input   Gradient  w.r.t.  parameters  

28  

Back-propagation •  Can  re-­‐apply  chain  rule  to  get  gradients  for  all  intermediate  values  and  parameters.  

“Backward”  modules  for  each  forward  stage.  

29  

Example

•  Given                ,  compute                    :  

   Using  several  items  from  our  table:      

30  

Training Procedure •  Collect  labeled  training  data  –  For  SGD:    Randomly  shuffle  aNer  each  epoch!  

•  For  a  batch  of  examples:  –  Compute  gradient  w.r.t.  all  parameters  in  network.  

– Make  a  small  update  to  parameters.  

–  Repeat  unCl  convergence.  

31  

Training Procedure

•  Historically,  this  has  not  worked  so  easily.  – Non-­‐convex:    Local  minima;    convergence  criteria.  – OpCmizaCon  becomes  difficult  with  many  stages.  

•  “Vanishing  gradient  problem”  – Hard  to  diagnose  and  debug  malfuncCons.  

•  Many  things  turn  out  to  ma>er:  – Choice  of  nonlineariCes.  –  IniCalizaCon  of  parameters.  – OpCmizer  parameters:    step  size,  schedule.  

32  

Nonlinearities

•  Choice  of  funcCons  inside  network  ma>ers.  – Sigmoid  funcCon  turns  out  to  be  difficult.  – Some  other  choices  oNen  used:  

1  

-­‐1  

1  

tanh(z)   ReLu(z)  =  max{0,  z}  

“RecCfied  Linear  Unit”  à  Increasingly  popular.  

1  

abs(z)  

[Nair  &  Hinton,  2010]  

33  

Initialization

•  Usually  small  random  values.  –  Try  to  choose  so  that  typical  input  to  a  neuron  avoids  saturaCng  /  non-­‐differenCable  areas.  

–  Has  to  be  random  otherwise  every  neuron  will  be  equal!    –  Occasionally  inspect  units  for  saturaCon  /  blowup.  –  Larger  values  may  give  faster  convergence,  but  worse  models!  

•  IniCalizaCon  schemes  for  parCcular  units:  –  tanh  units:    Unif[-­‐r,  r];    sigmoid:    Unif[-­‐4r,  4r].      See  [Glorot  et  al.,  AISTATS  2010]    

   

34  

Optimization: Step sizes •  Choose  SGD  step  size  carefully.  –  Up  to  factor  ~2  can  make  a  difference.  

•  Strategies:  –  Brute-­‐force:    try  many;    pick  one  with  best  result.  –  Racing:    pick  size  with  best  error  on  validaCon  data  aNer  T  steps.  •  Not  always  accurate  if  T  is  too  small.  

•  Step  size  schedule:  –  Fixed:  same  step  size    –  Step    – MulC-­‐Step  –  Inverse  

Bengio,  2012:    “PracCcal  RecommendaCons  for  Gradient-­‐Based  Training  of  Deep  Architectures”  Hinton,  2010:    “A  PracCcal  Guide  to  Training  Restricted  Boltzmann  Machines”  

35  

Optimization: Momentum •  “Smooth”  esCmate  of  gradient  from  several  steps  of  SGD:  

 •  A  li>le  bit  like  second-­‐order  informaCon.  –  High-­‐curvature  direcCons  cancel  out.  –  Low-­‐curvature  direcCons  “add  up”  and  accelerate.  

Bengio,  2012:    “PracCcal  RecommendaCons  for  Gradient-­‐Based  Training  of  Deep  Architectures”  Hinton,  2010:    “A  PracCcal  Guide  to  Training  Restricted  Boltzmann  Machines”  

37  

Other factors

•  “Weight  decay”  penalty  can  help.  – Add  small  penalty  for  squared  weight  magnitude.    

•  For  modest  datasets,  LBFGS  or  second-­‐order  methods  are  easier  than  SGD.  – See,  e.g.:    Martens  &  Sutskever,  ICML  2011.  – Can  crudely  extend  to  mini-­‐batch  case  if  batches  are  large.    [Le  et  al.,  ICML  2011]  

SUPERVISED DL FOR VISION

ApplicaCon  

39  

Working with images

•  Major  factors:  –  Choose  funcConal  form  of  network  to  roughly  match  the  computaCons  we  need  to  represent.  •  E.g.,  “selecCve”  features  and  “invariant”  features.  

–  Try  to  exploit  knowledge  of  images  to  accelerate  training  or  improve  performance.  

•  Generally  try  to  avoid  wiring  detailed  visual  knowledge  into  system  -­‐-­‐-­‐  prefer  to  learn.  

40  

Local connectivity

•  Neural  network  view  of  single  neuron:  Extremely  large  number  of  connecCons.  à More  parameters  to  train.  à Higher  computaConal  expense.  à Turn  out  not  to  be  helpful  in  pracCce.  

41  

Local connectivity •  Reduce  parameters  with  local  connecCons.  

–  Weight  vector  is  a  spaCally  localized  “filter”.  

42  

Local connectivity

•  SomeCmes  think  of  neurons  as  viewing  small  adjacent  windows.  –  Specify  connecCvity  by  the  size  (“recepCve  field”  size)  and  spacing  (“step”  or  “stride”)  of  windows.  •  Typical  RF  size  =  3  to  20  •  Typical  step  size  =  1  pixel  up  to  RF  size.  

43  

Local connectivity

•  SpaCal  organizaCon  of  filters  means  output  features  can  also  be  organized  like  an  image.  –  X,Y  dimensions  correspond  to  X,Y  posiCon  of  neuron  window.  

–  “Channels”  are  different  features  extracted  from  same  spaCal  locaCon.    (Also  called  “feature  maps”,  or  “maps”.)  

1D  input  

1-­‐dimensional  example:  

X  spaCal  locaCon  

“Channel”  or  “map”  index  

44  

Local connectivity Ø We  can  treat  output  of  a  layer  like  an  image  and  re-­‐use  the  

same  tricks.  

1D  input  

1-­‐dimensional  example:  

X  spaCal  locaCon  

“Channel”  or  “map”  index  

45  

Weight-Tying

•  Even  with  local  connecCons,  may  sCll  have  too  many  weights.  – Trick:    constrain  some  weights  to  be  equal  if  we  know  that  some  parts  of  input  should  learn  same  kinds  of  features.  

–  Images  tend  to  be  “staConary”:    different  patches  tend  to  have  similar  low-­‐level  structure.  Ø Constrain  weights  used  at  different  spaCal  posiCons  to  be  the  equal.  

46  

Weight-Tying Ø  Before,  could  have  neurons  with  different  weights  at  different  

locaCons.    But  can  reduce  parameters  by  making  them  equal.  

1D  input  

1-­‐dimensional  example:  

X  spaCal  locaCon  

“Channel”  or  “map”  index  

•  SomeCmes  called  a  “convoluConal”  network.    Each  unique  filter  is  spaCally  convolved  with  the  input  to  produce  responses  for  each  map.    [LeCun  et  al.,  1989;    LeCun  et  al.,  2004]  

47  

Pooling •  FuncConal  layers  designed  to  represent  invariant  features.  •  Usually  locally  connected  with  specific  nonlineariCes.  

–  Combined  with  convoluCon,  corresponds  to  hard-­‐wired  translaCon  invariance.  

•  Usually  fix  weights  to  local  box  or  Gaussian  filter.  –  Easy  to  represent  max-­‐,  average-­‐,  or  2-­‐norm  pooling.  

[Scherer  et  al.,  ICANN  2010]  [Boureau  et  al.,  ICML  2010]  

48  

Batch Normalization

•  Reduce  covariance  shiN  during  training  •  SpaCally  over  all  samples  in  each  batch    

…  Sample  1   Sample  2   Sample  m  

µB ←1

W × H ×mxw,h,i

w,h,i∑

σ B2 ← 1

W × H ×m(xw,h,i −

w,h,i∑ µB )

2

x∧

w,h,i ←xi − µB

σ B2 + ε

yi ←γ x∧

w,h,i+ β ≡ BNγ ,β (xw,h,i )

50  

Application: Image-Net •  System  from  Krizhevsky  et  al.,  NIPS  2012:  –  ConvoluConal  neural  network.  – Max-­‐pooling.  –  RecCfied  linear  units  (ReLu).  –  Local  response  normalizaCon.  –  Local  connecCvity.  

51  

Application: Image-Net

•  Top  result  in  LSVRC  2012:    ~85%,  Top-­‐5  accuracy.  

What’s  an  Agaric!?  

52  

More applications •  SegmentaCon:    predict  classes  of  pixels  /  super-­‐pixels.  

•  DetecCon:    combine  classifiers  with  sliding-­‐window  architecture.  –  Economical  when  used  with  convoluConal  nets.  

   •  RoboCc  grasping.      [Lenz  et  al.,  RSS  2013]  

ß  Ciresan  et  al.,  NIPS  2012  

Farabet  et  al.,  ICML  2012  à  

Pierre  Sermanet  (2010)  à  

h>p://www.youtube.com/watch?v=f9CuzqI1SkE  

Go

AlphaGo beat best human (Lee Sedol) 3/2016 (win-win-win-loss-win) Games are available at https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol

55  

How does AlphaGo work? (For all the details, see their paper:

www.nature.com/nature/journal/v529/n7587/pdf/nature16961.pdf ) •  Combines  two  techniques:  

– Monte  Carlo  Tree  Search  (MCTS)  •  Historical  advantages  of  MCTS  over  minimax-­‐search  approaches:  –  Does  not  require  an  evaluaCon  funcCon  –  Typically  works  be>er  with  large  branching  factors  

•  Improved  Go  programs  by  ~10  kyu  around  2006  – Deep  Learning  

•  ‘Value  networks’  to  evaluate  board  posiCons  and  •  ‘Policy  networks’  to  select  moves  

56  

AlphaGo hardware

•  EvaluaCng  policy  and  value  networks  requires  several  orders  of  magnitude  more  computaCon  than  tradiConal  search  heurisCcs  

•  AlphaGo  uses  an  asynchronous  mulC-­‐threaded  search  that  executes  simulaCons  on  CPUs,  and  computes  policy  and  value  networks  in  parallel  on  GPUs  

•  Final  version  of  AlphaGo  used  40  search  threads,  48  CPUs,  and  8  GPUs  

•  (They  also  implemented  a  distributed  version)  

57  

Speech Recognition

A.  Hannun,  C.  Case,  J.  Casper,  B.  Catanzaro,  G.  Diamos,  E.  Elsen,  R.  Prenger,  S.  Satheesh,  S.  Sengupta,  A.  Coates,  A.  Ng  ”Deep  Speech:  Scaling  up  end-­‐to-­‐end  speech  recogniCon”,  arXiv:1412.5567v2,  2014.

Output:  characters  (and  space)

Input:  acousCc  features  (spectrograms)

No  phoneme  and  lexicon (No  OOV  problem)

+  null  (~)

58  

Resources

Tutorials  Stanford  Deep  Learning  tutorial:  

h>p://ufldl.stanford.edu/wiki  

Deep  Learning  tutorials  list:  h>p://deeplearning.net/tutorials  

IPAM  DL/UFL  Summer  School:  h>p://www.ipam.ucla.edu/programs/gss2012/  

ICML  2012  RepresentaCon  Learning  Tutorial  h>p://www.iro.umontreal.ca/~bengioy/talks/deep-­‐learning-­‐tutorial-­‐2012.html  

59  

References h>p://www.stanford.edu/~acoates/bmvc2013refs.pdf  Overviews:  Yoshua  Bengio,    

“PracCcal  RecommendaCons  for  Gradient-­‐Based  Training  of  Deep  Architectures”  

 Yoshua  Bengio  &  Yann  LeCun,    

“Scaling  Learning  Algorithms  towards  AI”    Yoshua  Bengio,  Aaron  Courville  &  Pascal  Vincent,    

“RepresentaCon  Learning:  A  Review  and  New  PerspecCves”    

SoNware:  Theano  GPU  library:    h>p://deeplearning.net/soNware/theano  SPAMS  toolkit:    h>p://spams-­‐devel.gforge.inria.fr/  

top related