deep learning: an overviewfhs.mcmaster.ca/conted/documents/miit16/6. deep learning - brad... ·...

Post on 18-Apr-2018

228 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Deep Learning: An Overview

Bradley J Erickson, MD PhD

Mayo Clinic, Rochester

Medical Imaging Informatics and Teleradiology Conference1:30-2:05pm June 17, 2016

Disclosures

• Relationships with commercial interests:

– Board of OneMedNet

– Board of VoiceIT

What is “Machine Learning”?

• It is a part of Artificial Intelligence

• Finds patterns in data

– Patterns that reflect properties of examples (supervised)

– Patterns that separate examples (unsupervised)

• (Other types of artificial intelligence include rules systems)

Machine Learning Classes

Supervised

ANN

SVM

Random Forest

Bayes

DNN

Unsupervised

Clusters

Adaptive Resonance

Reinforced

Machine Learning History

• Artificial Neural Networks (ANN)

– Starting point of machine learning

– Early versions didn’t work well

• Other Machine Learning Methods

– Naïve Bayes

– Support Vector Machine (SVM)

– Random Forest Classifier (RFC)

Artificial Neural Network/Perceptron

f(Σ)

f(Σ)

f(Σ)

f(Σ)

f(Σ)

f(Σ)

f(Σ)

Input Layer Hidden Layer Output Layer

T1 Pre

T1 Post

T2

Tumor

Brain

Artificial Neural Network/Perceptron

45

322

128

f(Σ)

f(Σ)

f(Σ)

f(Σ)

f(Σ)

f(Σ)

f(Σ)

Input Layer Hidden Layer Output Layer

T1 Pre

T1 Post

T2

Tumor

Brain

Artificial Neural Network/Perceptron

45

322

128

f(Σ)

f(Σ)

34

57

418

-68

312

Input Layer Hidden Layer Output Layer

T1 Pre

T1 Post

T2

Tumor

Brain

Artificial Neural Network/Perceptron

45

322

128

1

0

34

57

418

-68

312

Input Layer Hidden Layer Output Layer

T1 Pre

T1 Post

T2

Tumor

Brain

How ANNs Learn

• Propagation

– Multiple prior layer node value times weight

– Activation function. E.g. threshold the sum

• Weight Update

– Compute error = actual output – expected output

– Weight gradient = error * input value

– New weight = old weight * gradient * learning rate

Learning = Optimization Problem

• Learning depends on:

– ‘Correct’ gradient directions

– ‘Correct’ gradient multiplier (learning rate)

Global Minimum

Local Minimum Small Gradient

Support Vector Machines

• Maps input data to new ‘space’

• Creates hyperplane that separates classes in that space

f(x)

Deep Learning: Why the Hype?

Performance in ImageNet Challenge

Team / Software Year Error Rate

XRCE (not Deep Learning) 2011 25.8%

SuperVision (AlexNet) 2012 16.4%

Clarifai 2013 11.7%

GoogLeNet (Inception) 2014 6.66%

Andrej Karpathy (human comparison) 2014 5.1%

BN-Inception (Arxiv) 2015 4.9%

Inception-v3 (Arxiv) 2015 3.46%

What is “Deep Learning”

• “Deep” because it uses many layers

– ANN typically had 3 or fewer layers

DNNs have 15+ layers

Types of DNNs

• Convolutional Neural Network (CNN)

– Early layers have ‘windows’ of image as input

– Multiplied by a ‘kernel’ to get output

– Known as a convolution

22

14

18

44

21

13

27

64

55

32

0

28

89

32

15

31

43

65

41

33

71

21

32

4

7

1

2

1

1

2

1

2

4

2

Types of DNNs

• Convolutional Neural Network (CNN)

– Early layers have ‘windows’ of image as input

– Multiplied by a ‘kernel’ to get output

– Known as a convolution

22

14

18

44

21

13

27

64

55

32

0

28

89

32

15

31

43

65

41

33

71

21

32

4

7

1

2

1

1

2

1

2

4

2

Types of DNNs

• Convolutional Neural Network (CNN)

– Early layers have ‘windows’ of image as input

– Multiplied by a ‘kernel’ to get output

– Known as a convolution

22

14

18

44

21

13

27

64

55

32

0

28

89

32

15

31

43

65

41

33

71

21

32

4

7

1

2

1

1

2

1

2

4

2

22

* =

Types of DNNs

• Convolutional Neural Network (CNN)

– Early layers have ‘windows’ of image as input

– Multiplied by a ‘kernel’ to get output

– Known as a convolution

22

14

18

44

21

13

27

64

55

32

0

28

89

32

15

31

43

65

41

33

71

21

32

4

7

1

2

1

1

2

1

2

4

2

22

28

18

0

56

89

26

108

128

53/ 9

Types of DNNs

• Convolutional Neural Network (CNN)

– Early layers have ‘windows’ of image as input

– Multiplied by a ‘kernel’ to get output

– Known as a convolution

22

14

18

44

21

13

27

64

55

32

0

28

89

32

15

31

43

65

41

33

71

21

32

4

7

1

2

1

1

2

1

2

4

2

13

56

64

31

86

65

0

112

178

53 67/ 9

Why the Excitement Now?Advances That Addressed Problems

• Many layers -> Overfitting

– Implement sparsity in weights: Dropout

Why the Excitement Now?Advances That Addressed Problems

• Many layers -> Vanishing Gradients

– Drop out partially addresses this

– Can use ‘pre-trained’ weights for early layers, and fix those, with weights of later layers for learning higher level features

Typical CNNs

Convolution Pooling Pooling Convolution Pooling Fully Connected

Typical CNNs

Andrei Karpathy: http://karpathy.github.io/2015/10/25/selfie/

Why the Excitement Now?Batch Normalization

• What should be the initial set of weights connecting nodes?

– All the same = no gradients

– Random. But what range of values?

• BatchNorm:

– After each Convolutional layer

– Subtract mean / divide by standard deviation

• Simple but effective

Why the Excitement Now?Residual Networks

*Targ, ICLR 2016

• Residual defines if and how to pass data through from layer to layer.

• Makes deep network construction reliable

Why the Excitement Now?

• Deep Neural Network Theory

• Exponential Compute Power Growth

Moore’s Law

Computing performance doubles approximately every 18 months

Exponentials In Real Life

• If you put 1 drop of water into a football stadium, and then double the number of drops each minute:

– At 5 minutes, you will have 32 drops

– At 45 minutes, you will cover the field 1"

– At 55 minutes, the stadium will be full

• It is not natural for humans to grasp exponential growth

Deep Learning Works Well on GPUs

• Naturally parallel

• Less precision (single precision FP) actually can be advantage

• Now building cards with no video output and optimized for Deep Learning (P-100)

GPUs are Beating Moore’s Law

FPGA

FPGA

TPU

GPU

CPU

Ice Age 2000 2005 2010 2015 2020

1,000,000

100,000

10,000

1000

100

10

Deep Learning Myths

• “You Need Millions of Exams to Train and Use Deep Learning Methods”

Deep Learning Myths

• “You Need Millions of Exams to Train and Use Deep Learning Methods”

Ways To Avoid Need For Large Data Sets

• Data Augmentation

– Essentially, creating variants of data that are different enough that they are learnable

– Similar enough that they teaching point is kept

– Mirror/Flip/Rotate/Contrast/Crop

Ways To Avoid Need For Large Data Sets

• Data Augmentation

• Transfer Learning

Imag

e

Co

nv

Max

Poo

l

Fully

Co

nn

ecte

d

Soft

Max

Fully

Co

nn

ecte

d

Fully

Co

nn

ecte

d

Co

nv

Co

nv

Max

Poo

l

Co

nv

Co

nv

Max

Poo

l

Co

nv

Train on Large Corpus like ImageNet

Ways To Avoid Need For Large Data Sets

• Data Augmentation

• Transfer Learning

Imag

e

Co

nv

Max

Poo

l

Fully

Co

nn

ecte

d

Soft

Max

Fully

Co

nn

ecte

d

Fully

Co

nn

ecte

d

Co

nv

Co

nv

Max

Poo

l

Co

nv

Co

nv

Max

Poo

l

Co

nv

Freeze These Layers

Ways To Avoid Need For Large Data Sets

• Data Augmentation

• Transfer Learning

Imag

e

Co

nv

Max

Poo

l

Fully

Co

nn

ecte

d

Soft

Max

Fully

Co

nn

ecte

d

Fully

Co

nn

ecte

d

Co

nv

Co

nv

Max

Poo

l

Co

nv

Co

nv

Max

Poo

l

Co

nv

Freeze These Layers Train this

Take Home Point

• Deep Learning Learns Features and Connections vs Just Connections

Hand-Crafted Feature Extraction

Learning Feature Extractor

Classifier

Classifier

Examples of CNN in Medical Imaging: Body Part

*Roth, Arxiv 2016

Examples of CNN in Medical Imaging: Segmentation

Mo

esko

ps,

IEEE

-TM

I, 2

01

6

Mayo: AutoEncoder for Segmentation

• Dataset– Trained on Brats 2015– Flair enhancing signal

• Preprocessing – N4 bias correction– Nuyl intensity normalization

• Autoencoders trained on 110.000 ROIs (size=12)• Time 1 hour for 155 slices (DNN would be days

or weeks)

Korfiatis, Submitted

What is an AutoEncoder?

Korfiatis, Submitted

Dice = 0.92 over BRATS dataset

Korfiatis, Submitted

Machine Learning & Radiomics

• Computers find textures reflecting genomics: 1p19q

• 85 Subjects with FISH results, computed multiple textures, SVM

# Features Sens Spec F-score Accuracy

SVMAbstract

1010

0.910.95

0.870.93

0.930.96

0.910.95

Naïve Bayes 12 0.95 0.77 0.92 0.89

Erickson, Proc ASNR, 2016

Machine Learning & Radiomics

• 155 Subjects, GBM, MGMT Methylation

• Compute textures (T2 was best) -> SVM

Korfiatis, Med Phys, 2016

Deep Learning: MGMT Methylation

• Same set of patients, use VGGNet / Xfer: Az=0.86

• Autoencoder is giving nearly as good performance and trains about 10x faster

• Now testing DeepMedic and RNN

Korfiatis, unpublished

The Pace of Change

Will Computers Replace Radiologists?

• Deep Learning will likely be able to create reports for diagnostic images in the future.

– 5 years: Mammo & CXR

– 10 years: CT Head, Chest, Abd, Pelvis, MR head, knee, shoulder, US: liver, thyroid, carotids

– 15-20 years: most diagnostic imaging

• Will likely ‘see’ more than we do today

• Will allow radiologists for focus on patient interaction and invasive procedures

How Might Medicine Best Embrace Deep Learning

How Might Medicine Best Embrace Deep Learning

• Algorithms for Machine Learning are rapidly improving. CNN are not the only game in town

• Hardware for Machine Learning is REALLY rapidly improving

• The amount of change in 20 years will be unbelievable

How Might Medicine Best Embrace Deep Learning

• Medicine needs to remain flexible about hardware and software

• The VALUE is in the data and metadata

• Physicians are OBLIGATED to make sure the data are properly handled.

– Improper interpretation of data will lead to bad implementations and poor patient care

– Non-cooperation is also counter-productive

top related