accurate and efficient gesture spotting via pruning and subgesture reasoning

Computer Science

Accurate and Efficient Gesture Spotting via Pruning and Subgesture Reasoning

Jonathan Alon, Vassilis Athitsos, and Stan Sclaroff

Computer Science DepartmentBoston University

Computer Science

Gesture Recognition Applications

Human ComputerInteraction

Sign LanguageAnalysis

VideoAnnotation

Command spotting to control:

•Computer Applications [Lee&Kim 99,Zhu et al 02]•TV and Video games [Freeman et al 96, 99]•Robots [Triesch 97]

UAV Guidance

Computer Science

Classification of Gesture Recognition Problems

Isolated Continuous

Easier HarderSpotting and Recognition

Computer Science

Gesture Spotting Problem Given a vocabulary of gestures:

Locate the start and frame of a gesture within a long video stream (and recognize the gesture).

non-gesture“2” gesture “5” gesture

Frame 334 Frame 403 Frame 733 Frame 836

Computer Science

Overview

Objective Propose an efficient and accurate gesture

spotting and recognition system that enables most natural human computer interaction.

Approach1. Pruning method that views pruning as a

classification (learning) problem2. Subgesture reasoning process that models the

fact that a gesture may resemble a part of a longer gesture

Experiments Order of magnitude speedup 18% improvement in accuracy

Computer Science

Gesture Spotting FrameworkIndirect approach: spotting is intertwined with recognition:

Temporal Matching: Continuous Dynamic Programming (CDP) [Oka 98]

Spotting [Morguet&Lang 98, Lee&Kim 99]

Hand Detection + Feature Extraction

Temporal Matching

Gesture id,start and end frames

Video Stream

FeatureVector

Gesture Models + Pruning Classifiers

Spotting

Spotting Rules +Subgesture table

MatchingCosts

Computer Science

Hand Detection and Feature Extraction

Hand Detection: based on color and motion

Feature: (x,y) hand centroid

Skin Likelihood Frame Differencing

Input Frame

“Hand Likelihood”

Detected Hand

Computer Science

Temporal Matching: Continuous Dynamic Programming (CDP)

Input

Mod

el

time j

time i

Computer Science


Input

Mod

el

time j

time i

d(i,j)

Local Cost: d(i,j)=L2(Mi,Qj)

Computer Science


Input

Mod

el

time j

time i

D(i,j)

Cumulative Cost: D(i,j)=d(i,j)+min{D(i-1,j), D(i,j-1), D(i-1,j-1)}

Computer Science


Input

Mod

el

time j

time i

D(m,j)

W

Computer Science


Input

Mod

el

time j

time i

D(m,j1) D(m,j2)

D(m,j2) < D(m,j1)

Computer Science

Spotting: Detection of candidate gesture end point

Detectionthreshold

matching cost Dg(mg,j)

time j

Computer Science

Why Pruning?

Search time for best matching model increases linearly with the number of gesture models. This can be too expensive for Systems with large gesture vocabularies Real time applications

Efficient search methods [Gao et al 00] Fast match, N-best search, A*,… Beam search

maintains promising hypotheses that have low matching costs within a “beam width” from the matching cost of the current best hypothesis.

requires ad hoc setting of “beam width”.

Computer Science

Pruning: Novel Viewpoint

Pruning is a classification problem, so we can use any classifier, e.g., based on cumulative cost. based on observation cost. based on transition cost.

Classifiers can be learned from training data, instead of manually specifying “beam width”.

Pruning is decoupled from recognition.

Computer Science

Pruning: Motivating Example If input feature j is too far from model feature i (d(i,j)

> τi) then all paths going through cell (i,j) should be pruned.

For example, the start point of digit “5” is far from the start point of digit “2” both in terms of position and direction.

Computer Science

How to Prune?

Classifier learning objective: maximize pruning (white cells area) s.t. minimize expectation of pruning the optimal path (red).

Input (digit “6”)

Mod

el (

digi

t “6”

)

Legend:White: pruned cells• pixBlack: visited cellsRed: optimal path

iji QMd ),(

Computer Science

Learning to prune: example classifier

1. Match every positive gesture example Mp with model M.

2. For every model feature Mi record all features Mp

j that match it (using DTW).

3. Let The pruning classifier for model feature Mi is:

iji

ijiji τ) ,Qd(M

τ) ,Qd(MQC

if 1

if 1)(

),(max jp

iji MMdτ

Computer Science

CDP with Pruning (CDPP)

Qj-1 Qj

M1

Mm

•Sparse vector representation:black cells are stored in memory

Computer Science


C1(Qj) = ?

Qj-1 Qj

M1

Mm

Computer Science


C1(Qj) = +1

Qj-1 Qj

M1

Mm

Computer Science


Qj-1 Qj

M1

Mm

C2(Qj) = ?

Computer Science


Qj-1 Qj

M1

Mm

C2(Qj) = +1

Computer Science


Qj-1 Qj

M1

Mm

C3(Qj) = ?

Computer Science


Qj-1 Qj

M1

Mm

C3(Qj) = -1

Skip to nextcell that hasa neighbor

Computer Science

Spotting

Spotting Rules

OUTPUT

•Detected gesture and gesture endpoint

OR

•New candidategesture list

INPUT

•Matching costs in current frame j•Current candidategesture list

•matching cost•duration

Optional:•Frame index of last detected gesture•Response time

Computer Science

Nested Gestures

Which gesture to recognize?5 or 8? 7 or 3? 1 or 9?

Computer Science

Nested Gestures

time j

mat

chin

g co

st

Which gesture to recognize?5 or 8?

Computer Science

Nested Gestures

Which gesture to recognize?5 or 8? Subgesture Supergesture

0 9

1 4,7,9

4 2,5,6,8,9

5 8

7 2,3,9

Solution: subgesture table

•If a gesture is firing then if at least one of its supergestures is firing then wait; otherwise, recognize it.•If a gesture is firing and it has no supergestures then recognize it.

Computer Science

Spotting Algorithm (1)

Update Candidate Gesture List:

1. Find all firing models.

2. Conduct subgesture competitions among firing models.

3. Find the best firing model.

4. For every candidate perform overlapping and subgesture tests wrt best firing model.

5. Remove candidate if failed any test.

6. Add the best firing model if passed all tests.

Computer Science

Spotting Algorithm (2)

Spot candidate gesture if either

1. all of its active supergesture models started after the candidate's end frame j*.

2. all current active paths started after the candidate's end frame j*.

3. a specified number of frames have elapsed since the candidate's end frame j*.

Computer Science

Experiments

Models: 2 users * 10 digits * 3 examples per digit. Test: 2 users * 3 long sequences * 10 digits. Sequence length: input: 1000-1500 frames. digit: 30-

90 frames.

Example Sequence

Computer Science

Results

Accuracy:

CDP = Continuous Dynamic Programming, CDPP = CDP with Pruning,

CDPPS = CDP with Pruning and Subgesture Resoning

Speedup: CDPP 10 times faster than CDP.

Method CDP CDPP CDPPS

Detection Rate 78.3% 85.0% 96.7%

False Matches 13 9 2

Computer Science

Conclusions

Pruning is a classification problem. CDPP an order of magnitude faster than

CDP, and 7% more accurate than CDP.

Reasoning about nested gestures improves recognition accuracy.

CDPPS improves accuracy by additional 12%.

Both pruning and subgesture reasoning can be applied to other dynamic models (e.g., HMMs).

Computer Science

Thank you

Computer Science

Ongoing Work

Learn1. Pruning classifiers using cross-validation.

2. Subgesture table.

3. Gesture verifiers.

Compare pruning method to Beam Search. Handle multiple candidate hand hypotheses.

Apply methods to automatic sign language transcription.

Computer Science

Towards Automatic Annotationof American Sign Language

Additional challenges: Users not cooperative: fast gesture speeds;

variation between users. Significant variation in hand shape and appearance. Different types of gestures: finger spelling, one vs.

two handed.

Computer Science

Gesture Spotting: Related Work Direct approach [Kang et. al 04, Kahol et. Al 04]

Spotting precedes recognition.

1. Compute low-level motion parameters, such as velocity, acceleration, trajectory curvature.

2. Look for abrupt changes (zero-crossings) in those parameters to find candidate gesture boundaries.

Indirect approach [Morguet&Lang 98, Lee&Kim 99]

Spotting is intertwined with recognition.

1. Compute input to models matching costs.

2. Look for low cost to detect candidate gesture end point. (Gesture start point can be found by backtracking the optimal dynamic programming path).

Computer Science

Approach: Continuous Dynamic Programming (CDP) [Oka 98]

Input

“0”

Mod

el“2

” M

odel

“9”

Mod

el

Mi

Mi-1

Qj-1 Qj

)}1,1(),1,(),,1(min{),(),( jiDjiDjiDjidjiD

d(i,j): distance between model feature Mi and input feature Qj.D(i,j): cumulative distance between model M(1:i) and input subsequence Q(j’:j)Continuous and Monotonic Warping Path

Computer Science

Approach: Continuous Dynamic Programming (CDP)

Mi

Mi-1

Qj-1 Qj

)}1,1(),1,(),,1(min{),(),( jiDjiDjiDjidjiD

d(i,j): distance between model feature Mi and input feature Qj.D(i,j): cumulative distance between model M(1:i) and input subsequence Q(j-:j)

0

2

9

Acceptthreshold

Input

Model

time j

time iGesture Start End

Optimal Warping Path(continuous & monotonic)

Computer Science

Conclusions

CDPP an order of magnitude faster than CDP, and 7% more accurate than CDP.

CDPPS improves accuracy by additional 12%. Both pruning and subgesture reasoning can

be applied to Hidden Markov Models (HMMs). Future Work:

Learn:1. Subgesture table.2. Gesture Transition Classifiers and Subsequence

Classifiers.3. Gesture Verifiers.

Apply methods to spot signs in American Sign Language (ASL) sequences (e.g., utterances, stories, and dialogs).

Computer Science

Gesture Types (Channels)

Head GestureBody Gesture Hand Gesture

Computer Science

Gesture Spotting: Related Work

Indirect approach [Morguet&Lang 98, Lee&Kim 99]

Spotting is intertwined with recognition.

0. Detect hands and extract features.

1. Compute input to models matching costs.

2. Look for low cost to detect candidate gesture end point.

Computer Science

Gesture End Point Detection and Gesture Recognition

The algorithm is invoked for every input frame j, and consists of two steps:

1. Update the current list of candidate gesture models.

2. Apply a set of spotting rules to decide whether or not a gesture was spotted, and if yes decide which gesture model.

Computer Science

End Point Detection Definitions

Paths Complete Path W(M1:m, Qj’:j): a legal

warping path matching the input subsequence Qj’:j with the complete model M1:m.

Partial Path W(M1:i, Qj’:j): a legal warping path matching the input subsequence Qj’:j with part of the model M1:m.

Active Path: a partial path that has not been pruned.

Computer Science

End Point Detection Definitions

Models Active Model g: a model that has a

complete path ending at the current input frame j.

Firing Model g: an active model with a cost below the detection acceptance threshold.

Subgesture Relationship: a gesture g1 is a subgetsure of gesture g2 if it is properly contained in g2. In this case, g2 is a supergesture of g1.

Computer Science

Spotting Rules (1)

Zhu et. Al 02 (Spotting rules) Based on Baudel&Beaudouin-Lafon’s

Interaction Model.1. A moving hand appears in the sequence.2. The moving hand is the dominant moving

object.3. The movement of the hand follows a

three-stage process: preparation, stroke, and retraction [Kendon].

4. The duration of the stroke T is bounded, T1≤T≤T2, for a given sampling rate.

Computer Science

Spotting Rules (3)

Lee&Kim 99 (End-point detection):

Computer Science

Gesture Spotting: Applications

Command spotting for Controlling Computer Applications [Lee&Kim 99,Zhu et

al 02] TV and Video games [Freeman et al 96, 99] Robots [Triesch 97]

Sign Language Analysis [Starner&Pentland 95, Vogler&Metaxas 99,…]

[Cui&Weng 96, Yang&Ahuja 99, Bowden et al 04, folks at ucf]

Computer Science

Implementation Details

We use a circular buffer of fixed length (e.g., 150 frames) to implement the sliding window concept.

We use a sparse vector representation that enables fast individual element access (compared to fast matrix vector operations as in Matlab).

We sacrifice memory in favor of efficiency and no fragmentation by preallocating memory for the sparse vectors to their max. capacity (model length).

Computer Science

Sparse Vector Representation

D1j

i=1D2j

i=2

D5j

i=3

nil

nil

nil

nil

indj listj

Computer Science

CDPP in a picture

Ci(Qj) = ?

nil

nil

nil

nil

indj-1 listj-1

Computer Science

Example Spotting Rules

Morguet&Lang 98 (Peak finding rules):1. Cost must be a local minimum inside

interval centered at current frame.

2. Cost must be smaller than a model-dependent threshold.

3. Cost must be lowest compared to all other model costs.

4. Cost must have a minimum temporal distance to the last valid found.

Computer Science

d(i,j)

Temporal Matching: Continuous Dynamic Programming

Input

Mod

el

time j

time i

Qj=(xj,yj)

Mi=(xi,yi)


Mm

M1

Computer Science


D(i,j)

Input

Mod

el

time j

time i Cumulative Cost: D(i,j)=d(i,j)+min{D(i-1,j), D(i,j-1), D(i-1,j-1)}

Mi=(xi,yi)

Qj=(xj,yj)

M1

Mm

Computer Science


W(i,j)

Input

Mod

el

time j

time i Warping Path: W(i,j)=((1,j’),…,(i,j))

Mi=(xi,yi)

Qj=(xj,yj)Qj’

M1

Mm

Computer Science


D(m,j)

Input

Mod

el

time j

time i Cumulative Cost D(m,j) is used for spotting and recognition

Mi=(xi,yi)

Qj=(xj,yj)

M1

Mm

Computer Science


Input

Mod

el

time j

time i

Qj=(xj,yj)

Mi=(xi,yi)

Mm

M1


d(i,j)

Computer Science


D(i,j)

Input

Mod

el


Mi=(xi,yi)

M1

Mm

time jQj=(xj,yj)

Computer Science


Input

Mod

el

time j

time i

Qj=(xj,yj)

Mi=(xi,yi)

Mm

M1

Qjs

W

D(m,j)


d(i,j)

Computer Science


Input

Mod

el

time j

time i

Qj=(xj,yj)

Mi=(xi,yi)

Mm

M1


d(i,j)

Computer Science


Input

Mod

el

time j

time i

Qj=(xj,yj)

Mi=(xi,yi)

Mm

M1

D(i,j)


Computer Science

Pruning: Common Practice

Beam-Search [Jelinek 97, Gao et al 00] Idea: only maintain promising hypotheses

that have low cum. costs within a “beam width” from the cum. cost of the current best hypothesis.

Works well in practice, but requires ad hoc setting of the beam width parameter.

Computer Science

Objective

Propose an accurate and efficient gesture spotting and recognition system that enables most natural human computer interaction.

Computer Science

Overview

Introduction Classification of Gesture Recognition Problems Gesture Spotting: Problem Definition Objective Applications

Approach Related Work: Continuous Dynamic

Programming Pruning as a classification problem Subgesture reasoning

Experiments Order of magnitude speedup 18% improvement in accuracy

Computer Science

Pruning: Approach

Likely Cells Visited CellsVisited Likely Cells

Black: likely d(i,j) ≤ τi

White: unlikely&pruned d(i,j) > τi

Black: likely&visitedWhite: unlikely&prunedGray: likely&pruned

Black: visitedWhite: pruned

84% pruned cellsor 6.25 speedup

White: White: White:

j

“6”

Mod

el

Computer Science

CDPP in a picture

Ci(Qj) = ?

nil

nil

nil

nil

indj-1 listj-1

Computer Science


Input

Mod

el

time j

time i

Qj=(xj,yj)

Mi=(xi,yi)

Mm

M1

Qjs

W

D(m,j)

Computer Science

d(i,j)


Input

Mod

el

time j

time i

Qj=(xj,yj)

Mi=(xi,yi)


Mm

M1

Computer Science


D(i,j)

Input

Mod

el

time j


Mi=(xi,yi)

Qj=(xj,yj)

M1

Mm

Computer Science

Gesture Spotting: Related Work

Direct approach [Kang et. al 04, Kahol et. Al 04]

Spotting precedes recognition.

1. Compute low-level motion parameters, such as velocity, acceleration, trajectory curvature.

2. Look for abrupt changes (zero-crossings) in those parameters to find candidate gesture boundaries.

Computer Science

Spotting: Detection of candidate gesture end point

0

2

9

Detectionthreshold

Dg(mg,j)DP tables

Computer Science

Spotting Definitions

Paths Complete Path W(M1:m, Qj’:j): a legal warping path

matching the input subsequence Qj’:j with the complete model M1:m.

Partial Path W(M1:i, Qj’:j): a legal warping path matching the input subsequence Qj’:j with part of the model M1:m.

Active Path: a partial path that has not been pruned. Models

Active Model g: a model that has a complete path ending at the current input frame j.

Firing Model g: an active model with a cost below the detection acceptance threshold.

Subgesture Relationship: a gesture g1 is a subgetsure of gesture g2 if it is properly contained in g2. In this case, g2 is a supergesture of g1.

Computer Science

Nested Gestures

time j

matching cost

Computer Science

Nested Gestures

Which gesture to recognize? 7 or 3? 5 or 8? 1 or 9?

Computer Science

Nested Gestures

Which gesture to recognize? Solution: store a subgesture table. 5 or 8?

If a gesture is firing then if at least one of its supergestures is firing then wait; otherwise, recognize it.

If a gesture is firing and it has no supergestures then recognize it.

Subgesture Supergesture

0 9

1 4,7,9

4 2,5,6,8,9

5 8

7 2,3,9

Computer Science

Nested Gestures

Which gesture to recognize? 5 or 8? 7 or 3? 1 or 9?

Solution: store a subgesture table. If a gesture is firing then if at least one of its

supergestures is firing then wait; otherwise, recognize it.

If a gesture is firing and it has no supergestures then recognize it.

Computer Science


Input

Mod

el

time j

time i

Qj=(xj,yj)

Mi=(xi,yi)

Mm

M1

Qjs

W

D(m,j)

Computer Science


Input

Mod

el

time j

time i

Qj=(xj,yj)

Mi=(xi,yi)

Mm

M1


d(i,j)

Computer Science


Input

Mod

el

time j

time i

Qj=(xj,yj)

Mi=(xi,yi)

Mm

M1

D(i,j)


Computer Science

How to Prune?


Mod

el (

digi

t “6”

)

Legend:White: pruned cells•d(Mi,Qj) > 100 pixBlack: visited cellsRed: optimal path

Computer Science

How to Prune?


Mod

el (

digi

t “6”

)


Computer Science

How to Prune?

Answer: learn classifiers (thresholds ): maximize pruning s.t. minimize expectation of pruning the optimal path.


Mod

el (

digi

t “6”

)

Legend:White: pruned cells• pixBlack: visited cellsRed: optimal path

i

iji QMd ),(

Computer Science


Input

Mod

el

time j

time i

Computer Science


Input

Mod

el

time j

time i

W

D(m,j)

Computer Science

Computer Science

Demo

accurate and efficient gesture spotting via pruning and subgesture reasoning

Documents