gesture recognition with 3d cnns - nvidia€¦ · online gesture classification italian sign...

50
April 4-7, 2016 | Silicon Valley Pavlo Molchanov Xiaodong Yang Shalini Gupta Kihwan Kim Stephen Tyree Jan Kautz GESTURE RECOGNITION WITH 3D CNNS 4/6/2016

Upload: others

Post on 12-Aug-2020

17 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GESTURE RECOGNITION WITH 3D CNNS - NVIDIA€¦ · ONLINE GESTURE CLASSIFICATION Italian sign language recognition 97.2 97.4 98.2 Pigou et al.* 3D-CNN 3D-CNN CTC Classification accuracy

April 4-7, 2016 | Silicon Valley

Pavlo Molchanov

Xiaodong Yang

Shalini Gupta

Kihwan Kim

Stephen Tyree

Jan Kautz

GESTURE RECOGNITION WITH 3D CNNS 4/6/2016

Page 2: GESTURE RECOGNITION WITH 3D CNNS - NVIDIA€¦ · ONLINE GESTURE CLASSIFICATION Italian sign language recognition 97.2 97.4 98.2 Pigou et al.* 3D-CNN 3D-CNN CTC Classification accuracy

2

AGENDA

Motivation

Problem statement

Selecting the best classifier

Online gesture detection and classification

Demos

Page 3: GESTURE RECOGNITION WITH 3D CNNS - NVIDIA€¦ · ONLINE GESTURE CLASSIFICATION Italian sign language recognition 97.2 97.4 98.2 Pigou et al.* 3D-CNN 3D-CNN CTC Classification accuracy

3

MOTIVATION

Page 4: GESTURE RECOGNITION WITH 3D CNNS - NVIDIA€¦ · ONLINE GESTURE CLASSIFICATION Italian sign language recognition 97.2 97.4 98.2 Pigou et al.* 3D-CNN 3D-CNN CTC Classification accuracy

4

GESTURE IS NATURAL FORM OF COMMUNICATION

photo.elsoar.com

Page 5: GESTURE RECOGNITION WITH 3D CNNS - NVIDIA€¦ · ONLINE GESTURE CLASSIFICATION Italian sign language recognition 97.2 97.4 98.2 Pigou et al.* 3D-CNN 3D-CNN CTC Classification accuracy

5

SAFE INTERFACES

@ bmw.com

Page 6: GESTURE RECOGNITION WITH 3D CNNS - NVIDIA€¦ · ONLINE GESTURE CLASSIFICATION Italian sign language recognition 97.2 97.4 98.2 Pigou et al.* 3D-CNN 3D-CNN CTC Classification accuracy

6

IN NEED FOR VIDEO RELAY SERVICES

@ http://relayservice.gov.au/

Page 7: GESTURE RECOGNITION WITH 3D CNNS - NVIDIA€¦ · ONLINE GESTURE CLASSIFICATION Italian sign language recognition 97.2 97.4 98.2 Pigou et al.* 3D-CNN 3D-CNN CTC Classification accuracy

7

GAMMING @ leapmotion

Page 8: GESTURE RECOGNITION WITH 3D CNNS - NVIDIA€¦ · ONLINE GESTURE CLASSIFICATION Italian sign language recognition 97.2 97.4 98.2 Pigou et al.* 3D-CNN 3D-CNN CTC Classification accuracy

8

PROBLEM STATEMENT

Page 9: GESTURE RECOGNITION WITH 3D CNNS - NVIDIA€¦ · ONLINE GESTURE CLASSIFICATION Italian sign language recognition 97.2 97.4 98.2 Pigou et al.* 3D-CNN 3D-CNN CTC Classification accuracy

9

PROBLEM STATEMENT

Single commodity sensor:

• Gesture recognition

• Skeleton tracking

• Gaze estimation

• Head tracking

No special devices

Kinectv1

SoftKinetic

Page 10: GESTURE RECOGNITION WITH 3D CNNS - NVIDIA€¦ · ONLINE GESTURE CLASSIFICATION Italian sign language recognition 97.2 97.4 98.2 Pigou et al.* 3D-CNN 3D-CNN CTC Classification accuracy

10

PROBLEM STATEMENT

Hand model fitting and tracking

*http://www.virtualrealityreviewer.com/leap-motion-enters-vr-new-software-product-accessory-preview-what%C2%B9s-next/

Understanding gesture concepts

Thumb up

Wave hand

We don’t: We do:

Classifier

Classifier

Page 11: GESTURE RECOGNITION WITH 3D CNNS - NVIDIA€¦ · ONLINE GESTURE CLASSIFICATION Italian sign language recognition 97.2 97.4 98.2 Pigou et al.* 3D-CNN 3D-CNN CTC Classification accuracy

11

PROBLEM STATEMENT

Hand model fitting and tracking

*http://www.virtualrealityreviewer.com/leap-motion-enters-vr-new-software-product-accessory-preview-what%C2%B9s-next/

Understanding gesture concepts

Thumb up

Wave hand

We don’t: We do:

Classifier

Classifier

??????

Page 12: GESTURE RECOGNITION WITH 3D CNNS - NVIDIA€¦ · ONLINE GESTURE CLASSIFICATION Italian sign language recognition 97.2 97.4 98.2 Pigou et al.* 3D-CNN 3D-CNN CTC Classification accuracy

12

SELECTING THE BEST CLASSIFIER

Page 13: GESTURE RECOGNITION WITH 3D CNNS - NVIDIA€¦ · ONLINE GESTURE CLASSIFICATION Italian sign language recognition 97.2 97.4 98.2 Pigou et al.* 3D-CNN 3D-CNN CTC Classification accuracy

13

SELECTING THE BEST CLASSIFIER VIVA CHALLENGE 2015 organized by UCLA

19 classes, 8 subjects

Driver and passenger

RGB + Depth from Microsoft Kinect

885 gestures in total

Page 14: GESTURE RECOGNITION WITH 3D CNNS - NVIDIA€¦ · ONLINE GESTURE CLASSIFICATION Italian sign language recognition 97.2 97.4 98.2 Pigou et al.* 3D-CNN 3D-CNN CTC Classification accuracy

14

SELECTING THE BEST CLASSIFIER VIVA CHALLENGE 2015 organized by UCLA

19 classes, 8 subjects

Driver and passenger

RGB + Depth from Microsoft Kinect

885 gestures in total

Gesture example:

Slide 2 fingers left

Page 15: GESTURE RECOGNITION WITH 3D CNNS - NVIDIA€¦ · ONLINE GESTURE CLASSIFICATION Italian sign language recognition 97.2 97.4 98.2 Pigou et al.* 3D-CNN 3D-CNN CTC Classification accuracy

15

SELECTING THE BEST CLASSIFIER VIVA CHALLENGE 2015 organized by UCLA

19 classes, 8 subjects

Driver and passenger

RGB + Depth from Microsoft Kinect

885 gestures in total

Gesture example:

Zoom out

Page 16: GESTURE RECOGNITION WITH 3D CNNS - NVIDIA€¦ · ONLINE GESTURE CLASSIFICATION Italian sign language recognition 97.2 97.4 98.2 Pigou et al.* 3D-CNN 3D-CNN CTC Classification accuracy

16

SELECTING THE BEST CLASSIFIER VIVA CHALLENGE 2015 organized by UCLA

19 classes, 8 subjects

Driver and passenger

RGB + Depth from Microsoft Kinect

885 gestures in total

Gesture example:

Rotate CCW

Page 17: GESTURE RECOGNITION WITH 3D CNNS - NVIDIA€¦ · ONLINE GESTURE CLASSIFICATION Italian sign language recognition 97.2 97.4 98.2 Pigou et al.* 3D-CNN 3D-CNN CTC Classification accuracy

17

SELECTING THE BEST CLASSIFIER 3D Convolutional Neural Network

ReLU ReLU

Softmax

Pre

dic

tion

RG

B

Depth

3D convolution

and max-pooling 3D convolution

and max-pooling

3D convolution

and max-pooling

3D convolution

and max-pooling

Page 18: GESTURE RECOGNITION WITH 3D CNNS - NVIDIA€¦ · ONLINE GESTURE CLASSIFICATION Italian sign language recognition 97.2 97.4 98.2 Pigou et al.* 3D-CNN 3D-CNN CTC Classification accuracy

18

SEGMENTED GESTURE CLASSIFICATION Training

3D CNN Back

propagation

error

update

RG

B

Depth

Page 19: GESTURE RECOGNITION WITH 3D CNNS - NVIDIA€¦ · ONLINE GESTURE CLASSIFICATION Italian sign language recognition 97.2 97.4 98.2 Pigou et al.* 3D-CNN 3D-CNN CTC Classification accuracy

19

SELECTING THE BEST CLASSIFIER First result

Classification accuracy, higher better

1 Oreifej and Liu. HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences, CVPR, 2013 2 Ohn-Bar and Trivedi, IEEE Trans. on Intelligent Transportation Systems, 2014.

HON4D1 HOG2 3D-CNN

Testing set 58.7% 64.5% 48.3%

Training set 99.9%

Page 20: GESTURE RECOGNITION WITH 3D CNNS - NVIDIA€¦ · ONLINE GESTURE CLASSIFICATION Italian sign language recognition 97.2 97.4 98.2 Pigou et al.* 3D-CNN 3D-CNN CTC Classification accuracy

20

SELECTING THE BEST CLASSIFIER

VIVA IMAGENET

1.5 M examples 885 examples

Recent success in deep learning benefited from large data

Page 21: GESTURE RECOGNITION WITH 3D CNNS - NVIDIA€¦ · ONLINE GESTURE CLASSIFICATION Italian sign language recognition 97.2 97.4 98.2 Pigou et al.* 3D-CNN 3D-CNN CTC Classification accuracy

21

SELECTING THE BEST CLASSIFIER Training

3D CNN Back

propagation

error

update

RG

B

Depth

Page 22: GESTURE RECOGNITION WITH 3D CNNS - NVIDIA€¦ · ONLINE GESTURE CLASSIFICATION Italian sign language recognition 97.2 97.4 98.2 Pigou et al.* 3D-CNN 3D-CNN CTC Classification accuracy

22

SELECTING THE BEST CLASSIFIER Training

Data

augmentation

Depth

3D CNN Back

propagation

error

update

RG

B

Page 23: GESTURE RECOGNITION WITH 3D CNNS - NVIDIA€¦ · ONLINE GESTURE CLASSIFICATION Italian sign language recognition 97.2 97.4 98.2 Pigou et al.* 3D-CNN 3D-CNN CTC Classification accuracy

23

SELECTING THE BEST CLASSIFIER Data augmentation

Spatial geometric transformations

Temporal augmentation

Generating new training data

Original

Augmented

Page 24: GESTURE RECOGNITION WITH 3D CNNS - NVIDIA€¦ · ONLINE GESTURE CLASSIFICATION Italian sign language recognition 97.2 97.4 98.2 Pigou et al.* 3D-CNN 3D-CNN CTC Classification accuracy

24

SELECTING THE BEST CLASSIFIER Data augmentation

Spatial geometric transformations

Temporal augmentation

Generating new training data

Original

Augmented

Page 25: GESTURE RECOGNITION WITH 3D CNNS - NVIDIA€¦ · ONLINE GESTURE CLASSIFICATION Italian sign language recognition 97.2 97.4 98.2 Pigou et al.* 3D-CNN 3D-CNN CTC Classification accuracy

25

SELECTING THE BEST CLASSIFIER Data augmentation

Spatial geometric transformations

Temporal augmentation

Generating new training data

Original

Augmented

Page 26: GESTURE RECOGNITION WITH 3D CNNS - NVIDIA€¦ · ONLINE GESTURE CLASSIFICATION Italian sign language recognition 97.2 97.4 98.2 Pigou et al.* 3D-CNN 3D-CNN CTC Classification accuracy

26

SELECTING THE BEST CLASSIFIER Data augmentation

Spatial geometric transformations

Temporal augmentation

Generating new training data

Original

Augmented

Page 27: GESTURE RECOGNITION WITH 3D CNNS - NVIDIA€¦ · ONLINE GESTURE CLASSIFICATION Italian sign language recognition 97.2 97.4 98.2 Pigou et al.* 3D-CNN 3D-CNN CTC Classification accuracy

27

SELECTING THE BEST CLASSIFIER Data augmentation

Spatial geometric transformations

Temporal augmentation

Generating new training data

Original

Augmented

Page 28: GESTURE RECOGNITION WITH 3D CNNS - NVIDIA€¦ · ONLINE GESTURE CLASSIFICATION Italian sign language recognition 97.2 97.4 98.2 Pigou et al.* 3D-CNN 3D-CNN CTC Classification accuracy

28

SELECTING THE BEST CLASSIFIER Data augmentation

Spatial geometric transformations

Temporal augmentation

Generating new training data

Original

Augmented

Page 29: GESTURE RECOGNITION WITH 3D CNNS - NVIDIA€¦ · ONLINE GESTURE CLASSIFICATION Italian sign language recognition 97.2 97.4 98.2 Pigou et al.* 3D-CNN 3D-CNN CTC Classification accuracy

29

SELECTING THE BEST CLASSIFIER Data augmentation

Spatial geometric transformations

Temporal augmentation

Generating new training data

Page 30: GESTURE RECOGNITION WITH 3D CNNS - NVIDIA€¦ · ONLINE GESTURE CLASSIFICATION Italian sign language recognition 97.2 97.4 98.2 Pigou et al.* 3D-CNN 3D-CNN CTC Classification accuracy

30

SELECTING THE BEST CLASSIFIER Data augmentation

Spatial geometric transformations

Temporal augmentation

Generating new training data

flip

Page 31: GESTURE RECOGNITION WITH 3D CNNS - NVIDIA€¦ · ONLINE GESTURE CLASSIFICATION Italian sign language recognition 97.2 97.4 98.2 Pigou et al.* 3D-CNN 3D-CNN CTC Classification accuracy

31

SELECTING THE BEST CLASSIFIER

VIVA AUGMENTED

0.3 M examples 885 examples

Page 32: GESTURE RECOGNITION WITH 3D CNNS - NVIDIA€¦ · ONLINE GESTURE CLASSIFICATION Italian sign language recognition 97.2 97.4 98.2 Pigou et al.* 3D-CNN 3D-CNN CTC Classification accuracy

32

SELECTING THE BEST CLASSIFIER Official challenge results

36.4

44.6

54

58.7

64.5

48.3

0 10 20 30 40 50 60 70 80

Harris-3.5D

HOG3D

Dense Trajectories

HON4D

HOG+HOG2

NVIDIA (3D-CNN) No data augmentation

Classification accuracy, higher better

Page 33: GESTURE RECOGNITION WITH 3D CNNS - NVIDIA€¦ · ONLINE GESTURE CLASSIFICATION Italian sign language recognition 97.2 97.4 98.2 Pigou et al.* 3D-CNN 3D-CNN CTC Classification accuracy

33

SELECTING THE BEST CLASSIFIER Official challenge results

36.4

44.6

54

58.7

64.5

48.3 77.5

0 10 20 30 40 50 60 70 80

Harris-3.5D

HOG3D

Dense Trajectories

HON4D

HOG+HOG2

NVIDIA (3D-CNN)

Classification accuracy, higher better

with data augmentation

Page 34: GESTURE RECOGNITION WITH 3D CNNS - NVIDIA€¦ · ONLINE GESTURE CLASSIFICATION Italian sign language recognition 97.2 97.4 98.2 Pigou et al.* 3D-CNN 3D-CNN CTC Classification accuracy

34

SELECTING THE BEST CLASSIFIER Speed

FPS, higher better

0.2

3

18

25

50

110 GPU +250 cuDNNv4 +400

0 100 200 300 400 500 600 700 800 900

Harris-3.5D

HOG3D

Dense Trajectories

HON4D

HOG+HOG2

NVIDIA (3D-CNN)

CPU

Page 35: GESTURE RECOGNITION WITH 3D CNNS - NVIDIA€¦ · ONLINE GESTURE CLASSIFICATION Italian sign language recognition 97.2 97.4 98.2 Pigou et al.* 3D-CNN 3D-CNN CTC Classification accuracy

35

SEGMENTED GESTURE CLASSIFICATION

Gesture

time Start of the gesture End of the gesture

Classification

Decision

Decision after gesture ends introduces latency

Page 36: GESTURE RECOGNITION WITH 3D CNNS - NVIDIA€¦ · ONLINE GESTURE CLASSIFICATION Italian sign language recognition 97.2 97.4 98.2 Pigou et al.* 3D-CNN 3D-CNN CTC Classification accuracy

36

ONLINE GESTURE DETECTION AND CLASSIFICATION

Page 37: GESTURE RECOGNITION WITH 3D CNNS - NVIDIA€¦ · ONLINE GESTURE CLASSIFICATION Italian sign language recognition 97.2 97.4 98.2 Pigou et al.* 3D-CNN 3D-CNN CTC Classification accuracy

37

ONLINE GESTURE CLASSIFICATION

Gesture

time Start of the gesture End of the gesture

Classification

Decision

Decision before gesture ends improve feedback and user experience

Page 38: GESTURE RECOGNITION WITH 3D CNNS - NVIDIA€¦ · ONLINE GESTURE CLASSIFICATION Italian sign language recognition 97.2 97.4 98.2 Pigou et al.* 3D-CNN 3D-CNN CTC Classification accuracy

38

ONLINE GESTURE CLASSIFICATION R3DCNN

Video server

3D

CN

N

3D

CN

N

RNN RNN RNN

softmax softmax softmax

global

motion

descriptor

local

motion

descriptor

8 frames

Forward recurrence only

Detection and classification

109M parameters

CTC for training only

Connectionist Temporal Classification (CTC)

Page 39: GESTURE RECOGNITION WITH 3D CNNS - NVIDIA€¦ · ONLINE GESTURE CLASSIFICATION Italian sign language recognition 97.2 97.4 98.2 Pigou et al.* 3D-CNN 3D-CNN CTC Classification accuracy

39

ONLINE GESTURE CLASSIFICATION Training loss function

Labeling dynamic gestures is difficult

Labeling per frame is ambiguous

Input:

Labels:

Loss function: Per frame negative log likelihood

Page 40: GESTURE RECOGNITION WITH 3D CNNS - NVIDIA€¦ · ONLINE GESTURE CLASSIFICATION Italian sign language recognition 97.2 97.4 98.2 Pigou et al.* 3D-CNN 3D-CNN CTC Classification accuracy

40

ONLINE GESTURE CLASSIFICATION Training loss function

Sequence based training is the solution

Input:

Sequence: nothing – slide right – nothing – slide left - nothing

Loss function: Connectionist Temporal Classification (CTC) by A. Graves et al.

Page 41: GESTURE RECOGNITION WITH 3D CNNS - NVIDIA€¦ · ONLINE GESTURE CLASSIFICATION Italian sign language recognition 97.2 97.4 98.2 Pigou et al.* 3D-CNN 3D-CNN CTC Classification accuracy

41

ONLINE GESTURE CLASSIFICATION Italian sign language recognition

Chalearn2014 challenge held in 2014

RGBD videos of 20 Italian sign language

13K gestures

20 subjects

Page 42: GESTURE RECOGNITION WITH 3D CNNS - NVIDIA€¦ · ONLINE GESTURE CLASSIFICATION Italian sign language recognition 97.2 97.4 98.2 Pigou et al.* 3D-CNN 3D-CNN CTC Classification accuracy

42

ONLINE GESTURE CLASSIFICATION Italian sign language recognition

97.2

97.4

98.2

Pigou et al.* 3D-CNN 3D-CNN CTC

Classification accuracy (%)

Improvement in accuracy

35%

By seeing only

41% of gesture

*L. Pigou et al. Beyond temporal pooling: Recurrence and temporal convolutions for gesture recognition in video

Page 43: GESTURE RECOGNITION WITH 3D CNNS - NVIDIA€¦ · ONLINE GESTURE CLASSIFICATION Italian sign language recognition 97.2 97.4 98.2 Pigou et al.* 3D-CNN 3D-CNN CTC Classification accuracy

43

ONLINE GESTURE CLASSIFICATION Italian sign language recognition

Improvement in accuracy

35%

By seeing only

41% of gesture

No pre- or post-processing

Page 44: GESTURE RECOGNITION WITH 3D CNNS - NVIDIA€¦ · ONLINE GESTURE CLASSIFICATION Italian sign language recognition 97.2 97.4 98.2 Pigou et al.* 3D-CNN 3D-CNN CTC Classification accuracy

44

ONLINE GESTURE CLASSIFICATION Car interfaces

In-house database

Media player, navigation, phone

20 subjects, 25 gestures

More information at CVPR2016

Page 45: GESTURE RECOGNITION WITH 3D CNNS - NVIDIA€¦ · ONLINE GESTURE CLASSIFICATION Italian sign language recognition 97.2 97.4 98.2 Pigou et al.* 3D-CNN 3D-CNN CTC Classification accuracy

45

ONLINE GESTURE CLASSIFICATION Car interfaces

37

66

71

73

79

84

88

25 45 65 85

HOG+HOG2

Two stream CNN

SNV

iDT

C3D

Ours

HumanIn-house database

Media player, navigation, phone

20 subjects, 25 gestures

More information at CVPR2016

Page 46: GESTURE RECOGNITION WITH 3D CNNS - NVIDIA€¦ · ONLINE GESTURE CLASSIFICATION Italian sign language recognition 97.2 97.4 98.2 Pigou et al.* 3D-CNN 3D-CNN CTC Classification accuracy

46

ONLINE GESTURE CLASSIFICATION

Suitability of hardware for inference:

Latency is critical

IMAGE CLASSIFICATION

GPU

CPU

VIDEO CLASSIFICATION

GPU

CPU

Page 47: GESTURE RECOGNITION WITH 3D CNNS - NVIDIA€¦ · ONLINE GESTURE CLASSIFICATION Italian sign language recognition 97.2 97.4 98.2 Pigou et al.* 3D-CNN 3D-CNN CTC Classification accuracy

47

ONLINE GESTURE CLASSIFICATION

NVIDIA TX1 - for embedded solutions

Credit card GPU in your pocket

Our R3DCNN takes only 30% of GPU

Scalability

Page 48: GESTURE RECOGNITION WITH 3D CNNS - NVIDIA€¦ · ONLINE GESTURE CLASSIFICATION Italian sign language recognition 97.2 97.4 98.2 Pigou et al.* 3D-CNN 3D-CNN CTC Classification accuracy

48

CONTRIBUTIONS

Data augmentation helps a lot to deep learning

R3DCNN are the best for sign language and gesture recognition

CTC helps a lot for video sequence learning

Scalable enough to run on NVIDIA TX1

Page 49: GESTURE RECOGNITION WITH 3D CNNS - NVIDIA€¦ · ONLINE GESTURE CLASSIFICATION Italian sign language recognition 97.2 97.4 98.2 Pigou et al.* 3D-CNN 3D-CNN CTC Classification accuracy

April 4-7, 2016 | Silicon Valley

CTC Deep

Learning

Data

Augmentation

Page 50: GESTURE RECOGNITION WITH 3D CNNS - NVIDIA€¦ · ONLINE GESTURE CLASSIFICATION Italian sign language recognition 97.2 97.4 98.2 Pigou et al.* 3D-CNN 3D-CNN CTC Classification accuracy

April 4-7, 2016 | Silicon Valley

THANK YOU

JOIN THE NVIDIA DEVELOPER PROGRAM AT developer.nvidia.com/join