accurate and efficient gesture spotting via pruning and subgesture reasoning
DESCRIPTION
Accurate and Efficient Gesture Spotting via Pruning and Subgesture Reasoning. Jonathan Alon, Vassilis Athitsos, and Stan Sclaroff Computer Science Department Boston University. Gesture Recognition Applications. Human Computer Interaction. Sign Language Analysis. Video Annotation. - PowerPoint PPT PresentationTRANSCRIPT
Computer Science
Accurate and Efficient Gesture Spotting via Pruning and Subgesture Reasoning
Jonathan Alon, Vassilis Athitsos, and Stan Sclaroff
Computer Science DepartmentBoston University
Computer Science
Gesture Recognition Applications
Human ComputerInteraction
Sign LanguageAnalysis
VideoAnnotation
Command spotting to control:
•Computer Applications [Lee&Kim 99,Zhu et al 02]•TV and Video games [Freeman et al 96, 99]•Robots [Triesch 97]
UAV Guidance
Computer Science
Classification of Gesture Recognition Problems
Isolated Continuous
Easier HarderSpotting and Recognition
Computer Science
Gesture Spotting Problem Given a vocabulary of gestures:
Locate the start and frame of a gesture within a long video stream (and recognize the gesture).
non-gesture“2” gesture “5” gesture
Frame 334 Frame 403 Frame 733 Frame 836
Computer Science
Overview
Objective Propose an efficient and accurate gesture
spotting and recognition system that enables most natural human computer interaction.
Approach1. Pruning method that views pruning as a
classification (learning) problem2. Subgesture reasoning process that models the
fact that a gesture may resemble a part of a longer gesture
Experiments Order of magnitude speedup 18% improvement in accuracy
Computer Science
Gesture Spotting FrameworkIndirect approach: spotting is intertwined with recognition:
Temporal Matching: Continuous Dynamic Programming (CDP) [Oka 98]
Spotting [Morguet&Lang 98, Lee&Kim 99]
Hand Detection + Feature Extraction
Temporal Matching
Gesture id,start and end frames
Video Stream
FeatureVector
Gesture Models + Pruning Classifiers
Spotting
Spotting Rules +Subgesture table
MatchingCosts
Computer Science
Hand Detection and Feature Extraction
Hand Detection: based on color and motion
Feature: (x,y) hand centroid
Skin Likelihood Frame Differencing
Input Frame
“Hand Likelihood”
Detected Hand
Computer Science
Temporal Matching: Continuous Dynamic Programming (CDP)
Input
Mod
el
time j
time i
Computer Science
Temporal Matching: Continuous Dynamic Programming (CDP)
Input
Mod
el
time j
time i
Computer Science
Temporal Matching: Continuous Dynamic Programming (CDP)
Input
Mod
el
time j
time i
d(i,j)
Local Cost: d(i,j)=L2(Mi,Qj)
Computer Science
Temporal Matching: Continuous Dynamic Programming (CDP)
Input
Mod
el
time j
time i
D(i,j)
Cumulative Cost: D(i,j)=d(i,j)+min{D(i-1,j), D(i,j-1), D(i-1,j-1)}
Computer Science
Temporal Matching: Continuous Dynamic Programming (CDP)
Input
Mod
el
time j
time i
D(m,j)
W
Computer Science
Temporal Matching: Continuous Dynamic Programming (CDP)
Input
Mod
el
time j
time i
D(m,j1) D(m,j2)
D(m,j2) < D(m,j1)
Computer Science
Spotting: Detection of candidate gesture end point
Detectionthreshold
matching cost Dg(mg,j)
time j
Computer Science
Why Pruning?
Search time for best matching model increases linearly with the number of gesture models. This can be too expensive for Systems with large gesture vocabularies Real time applications
Efficient search methods [Gao et al 00] Fast match, N-best search, A*,… Beam search
maintains promising hypotheses that have low matching costs within a “beam width” from the matching cost of the current best hypothesis.
requires ad hoc setting of “beam width”.
Computer Science
Pruning: Novel Viewpoint
Pruning is a classification problem, so we can use any classifier, e.g., based on cumulative cost. based on observation cost. based on transition cost.
Classifiers can be learned from training data, instead of manually specifying “beam width”.
Pruning is decoupled from recognition.
Computer Science
Pruning: Motivating Example If input feature j is too far from model feature i (d(i,j)
> τi) then all paths going through cell (i,j) should be pruned.
For example, the start point of digit “5” is far from the start point of digit “2” both in terms of position and direction.
Computer Science
How to Prune?
Classifier learning objective: maximize pruning (white cells area) s.t. minimize expectation of pruning the optimal path (red).
Input (digit “6”)
Mod
el (
digi
t “6”
)
Legend:White: pruned cells• pixBlack: visited cellsRed: optimal path
iji QMd ),(
Computer Science
Learning to prune: example classifier
1. Match every positive gesture example Mp with model M.
2. For every model feature Mi record all features Mp
j that match it (using DTW).
3. Let The pruning classifier for model feature Mi is:
iji
ijiji τ) ,Qd(M
τ) ,Qd(MQC
if 1
if 1)(
),(max jp
iji MMdτ
Computer Science
CDP with Pruning (CDPP)
Qj-1 Qj
M1
Mm
•Sparse vector representation:black cells are stored in memory
Computer Science
CDP with Pruning (CDPP)
C1(Qj) = ?
Qj-1 Qj
M1
Mm
Computer Science
CDP with Pruning (CDPP)
C1(Qj) = +1
Qj-1 Qj
M1
Mm
Computer Science
CDP with Pruning (CDPP)
Qj-1 Qj
M1
Mm
C2(Qj) = ?
Computer Science
CDP with Pruning (CDPP)
Qj-1 Qj
M1
Mm
C2(Qj) = +1
Computer Science
CDP with Pruning (CDPP)
Qj-1 Qj
M1
Mm
C3(Qj) = ?
Computer Science
CDP with Pruning (CDPP)
Qj-1 Qj
M1
Mm
C3(Qj) = -1
Skip to nextcell that hasa neighbor
Computer Science
Spotting
Spotting Rules
OUTPUT
•Detected gesture and gesture endpoint
OR
•New candidategesture list
INPUT
•Matching costs in current frame j•Current candidategesture list
•matching cost•duration
Optional:•Frame index of last detected gesture•Response time
Computer Science
Nested Gestures
Which gesture to recognize?5 or 8? 7 or 3? 1 or 9?
Computer Science
Nested Gestures
time j
mat
chin
g co
st
Which gesture to recognize?5 or 8?
Computer Science
Nested Gestures
Which gesture to recognize?5 or 8? Subgesture Supergesture
0 9
1 4,7,9
4 2,5,6,8,9
5 8
7 2,3,9
Solution: subgesture table
•If a gesture is firing then if at least one of its supergestures is firing then wait; otherwise, recognize it.•If a gesture is firing and it has no supergestures then recognize it.
Computer Science
Spotting Algorithm (1)
Update Candidate Gesture List:
1. Find all firing models.
2. Conduct subgesture competitions among firing models.
3. Find the best firing model.
4. For every candidate perform overlapping and subgesture tests wrt best firing model.
5. Remove candidate if failed any test.
6. Add the best firing model if passed all tests.
Computer Science
Spotting Algorithm (2)
Spot candidate gesture if either
1. all of its active supergesture models started after the candidate's end frame j*.
2. all current active paths started after the candidate's end frame j*.
3. a specified number of frames have elapsed since the candidate's end frame j*.
Computer Science
Experiments
Models: 2 users * 10 digits * 3 examples per digit. Test: 2 users * 3 long sequences * 10 digits. Sequence length: input: 1000-1500 frames. digit: 30-
90 frames.
Example Sequence
Computer Science
Results
Accuracy:
CDP = Continuous Dynamic Programming, CDPP = CDP with Pruning,
CDPPS = CDP with Pruning and Subgesture Resoning
Speedup: CDPP 10 times faster than CDP.
Method CDP CDPP CDPPS
Detection Rate 78.3% 85.0% 96.7%
False Matches 13 9 2
Computer Science
Conclusions
Pruning is a classification problem. CDPP an order of magnitude faster than
CDP, and 7% more accurate than CDP.
Reasoning about nested gestures improves recognition accuracy.
CDPPS improves accuracy by additional 12%.
Both pruning and subgesture reasoning can be applied to other dynamic models (e.g., HMMs).
Computer Science
Thank you
Computer Science
Ongoing Work
Learn1. Pruning classifiers using cross-validation.
2. Subgesture table.
3. Gesture verifiers.
Compare pruning method to Beam Search. Handle multiple candidate hand hypotheses.
Apply methods to automatic sign language transcription.
Computer Science
Towards Automatic Annotationof American Sign Language
Additional challenges: Users not cooperative: fast gesture speeds;
variation between users. Significant variation in hand shape and appearance. Different types of gestures: finger spelling, one vs.
two handed.
Computer Science
Gesture Spotting: Related Work Direct approach [Kang et. al 04, Kahol et. Al 04]
Spotting precedes recognition.
1. Compute low-level motion parameters, such as velocity, acceleration, trajectory curvature.
2. Look for abrupt changes (zero-crossings) in those parameters to find candidate gesture boundaries.
Indirect approach [Morguet&Lang 98, Lee&Kim 99]
Spotting is intertwined with recognition.
1. Compute input to models matching costs.
2. Look for low cost to detect candidate gesture end point. (Gesture start point can be found by backtracking the optimal dynamic programming path).
Computer Science
Approach: Continuous Dynamic Programming (CDP) [Oka 98]
Input
“0”
Mod
el“2
” M
odel
“9”
Mod
el
Mi
Mi-1
Qj-1 Qj
)}1,1(),1,(),,1(min{),(),( jiDjiDjiDjidjiD
d(i,j): distance between model feature Mi and input feature Qj.D(i,j): cumulative distance between model M(1:i) and input subsequence Q(j’:j)Continuous and Monotonic Warping Path
Computer Science
Approach: Continuous Dynamic Programming (CDP)
Mi
Mi-1
Qj-1 Qj
)}1,1(),1,(),,1(min{),(),( jiDjiDjiDjidjiD
d(i,j): distance between model feature Mi and input feature Qj.D(i,j): cumulative distance between model M(1:i) and input subsequence Q(j-:j)
0
2
9
Acceptthreshold
Input
Model
time j
time iGesture Start End
Optimal Warping Path(continuous & monotonic)
Computer Science
Conclusions
CDPP an order of magnitude faster than CDP, and 7% more accurate than CDP.
CDPPS improves accuracy by additional 12%. Both pruning and subgesture reasoning can
be applied to Hidden Markov Models (HMMs). Future Work:
Learn:1. Subgesture table.2. Gesture Transition Classifiers and Subsequence
Classifiers.3. Gesture Verifiers.
Apply methods to spot signs in American Sign Language (ASL) sequences (e.g., utterances, stories, and dialogs).
Computer Science
Gesture Types (Channels)
Head GestureBody Gesture Hand Gesture
Computer Science
Gesture Spotting: Related Work
Indirect approach [Morguet&Lang 98, Lee&Kim 99]
Spotting is intertwined with recognition.
0. Detect hands and extract features.
1. Compute input to models matching costs.
2. Look for low cost to detect candidate gesture end point.
Computer Science
Pruning: Motivation
Detection and Tracking Where (in the image is the gesture
performed)?
Spotting When (does the gesture start and end)?
Recognition What (gesture)?
Search complexity can be high ! | Where | * | When | * | What |
Computer Science
Gesture End Point Detection and Gesture Recognition
The algorithm is invoked for every input frame j, and consists of two steps:
1. Update the current list of candidate gesture models.
2. Apply a set of spotting rules to decide whether or not a gesture was spotted, and if yes decide which gesture model.
Computer Science
End Point Detection Definitions
Paths Complete Path W(M1:m, Qj’:j): a legal
warping path matching the input subsequence Qj’:j with the complete model M1:m.
Partial Path W(M1:i, Qj’:j): a legal warping path matching the input subsequence Qj’:j with part of the model M1:m.
Active Path: a partial path that has not been pruned.
Computer Science
End Point Detection Definitions
Models Active Model g: a model that has a
complete path ending at the current input frame j.
Firing Model g: an active model with a cost below the detection acceptance threshold.
Subgesture Relationship: a gesture g1 is a subgetsure of gesture g2 if it is properly contained in g2. In this case, g2 is a supergesture of g1.
Computer Science
Spotting Rules (1)
Zhu et. Al 02 (Spotting rules) Based on Baudel&Beaudouin-Lafon’s
Interaction Model.1. A moving hand appears in the sequence.2. The moving hand is the dominant moving
object.3. The movement of the hand follows a
three-stage process: preparation, stroke, and retraction [Kendon].
4. The duration of the stroke T is bounded, T1≤T≤T2, for a given sampling rate.
Computer Science
Spotting Rules (3)
Lee&Kim 99 (End-point detection):
Computer Science
Gesture Spotting: Applications
Command spotting for Controlling Computer Applications [Lee&Kim 99,Zhu et
al 02] TV and Video games [Freeman et al 96, 99] Robots [Triesch 97]
Sign Language Analysis [Starner&Pentland 95, Vogler&Metaxas 99,…]
[Cui&Weng 96, Yang&Ahuja 99, Bowden et al 04, folks at ucf]
Computer Science
Implementation Details
We use a circular buffer of fixed length (e.g., 150 frames) to implement the sliding window concept.
We use a sparse vector representation that enables fast individual element access (compared to fast matrix vector operations as in Matlab).
We sacrifice memory in favor of efficiency and no fragmentation by preallocating memory for the sparse vectors to their max. capacity (model length).
Computer Science
Sparse Vector Representation
D1j
i=1D2j
i=2
D5j
i=3
nil
nil
nil
nil
indj listj
Computer Science
CDPP in a picture
Ci(Qj) = ?
nil
nil
nil
nil
indj-1 listj-1
Computer Science
Example Spotting Rules
Morguet&Lang 98 (Peak finding rules):1. Cost must be a local minimum inside
interval centered at current frame.
2. Cost must be smaller than a model-dependent threshold.
3. Cost must be lowest compared to all other model costs.
4. Cost must have a minimum temporal distance to the last valid found.
Computer Science
d(i,j)
Temporal Matching: Continuous Dynamic Programming
Input
Mod
el
time j
time i
Qj=(xj,yj)
Mi=(xi,yi)
Local Cost: d(i,j)=L2(Mi,Qj)
Mm
M1
Computer Science
Temporal Matching: Continuous Dynamic Programming
D(i,j)
Input
Mod
el
time j
time i Cumulative Cost: D(i,j)=d(i,j)+min{D(i-1,j), D(i,j-1), D(i-1,j-1)}
Mi=(xi,yi)
Qj=(xj,yj)
M1
Mm
Computer Science
Temporal Matching: Continuous Dynamic Programming
W(i,j)
Input
Mod
el
time j
time i Warping Path: W(i,j)=((1,j’),…,(i,j))
Mi=(xi,yi)
Qj=(xj,yj)Qj’
M1
Mm
Computer Science
Temporal Matching: Continuous Dynamic Programming
D(m,j)
Input
Mod
el
time j
time i Cumulative Cost D(m,j) is used for spotting and recognition
Mi=(xi,yi)
Qj=(xj,yj)
M1
Mm
Computer Science
Temporal Matching: Continuous Dynamic Programming
Input
Mod
el
time j
time i
Qj=(xj,yj)
Mi=(xi,yi)
Mm
M1
Local Cost: d(i,j)=L2(Mi,Qj)
d(i,j)
Computer Science
Temporal Matching: Continuous Dynamic Programming
D(i,j)
Input
Mod
el
time i Cumulative Cost: D(i,j)=d(i,j)+min{D(i-1,j), D(i,j-1), D(i-1,j-1)}
Mi=(xi,yi)
M1
Mm
time jQj=(xj,yj)
Computer Science
Temporal Matching: Continuous Dynamic Programming
Input
Mod
el
time j
time i
Qj=(xj,yj)
Mi=(xi,yi)
Mm
M1
Qjs
W
D(m,j)
Local Cost: d(i,j)=L2(Mi,Qj)
d(i,j)
Computer Science
Temporal Matching: Continuous Dynamic Programming
Input
Mod
el
time j
time i
Qj=(xj,yj)
Mi=(xi,yi)
Mm
M1
Local Cost: d(i,j)=L2(Mi,Qj)
d(i,j)
Computer Science
Temporal Matching: Continuous Dynamic Programming
Input
Mod
el
time j
time i
Qj=(xj,yj)
Mi=(xi,yi)
Mm
M1
D(i,j)
Cumulative Cost: D(i,j)=d(i,j)+min{D(i-1,j), D(i,j-1), D(i-1,j-1)}
Computer Science
Pruning: Common Practice
Beam-Search [Jelinek 97, Gao et al 00] Idea: only maintain promising hypotheses
that have low cum. costs within a “beam width” from the cum. cost of the current best hypothesis.
Works well in practice, but requires ad hoc setting of the beam width parameter.
Computer Science
Objective
Propose an accurate and efficient gesture spotting and recognition system that enables most natural human computer interaction.
Computer Science
Overview
Introduction Classification of Gesture Recognition Problems Gesture Spotting: Problem Definition Objective Applications
Approach Related Work: Continuous Dynamic
Programming Pruning as a classification problem Subgesture reasoning
Experiments Order of magnitude speedup 18% improvement in accuracy
Computer Science
Pruning: Approach
Likely Cells Visited CellsVisited Likely Cells
Black: likely d(i,j) ≤ τi
White: unlikely&pruned d(i,j) > τi
Black: likely&visitedWhite: unlikely&prunedGray: likely&pruned
Black: visitedWhite: pruned
84% pruned cellsor 6.25 speedup
White: White: White:
j
“6”
Mod
el
Computer Science
CDPP in a picture
Ci(Qj) = ?
nil
nil
nil
nil
indj-1 listj-1
Computer Science
Temporal Matching: Continuous Dynamic Programming (CDP)
Input
Mod
el
time j
time i
Qj=(xj,yj)
Mi=(xi,yi)
Mm
M1
Qjs
W
D(m,j)
Computer Science
Temporal Matching: Continuous Dynamic Programming (CDP)
Input
Mod
el
time j
time i
Qj=(xj,yj)
Mi=(xi,yi)
Mm
M1
Qjs
W
D(m,j)
Computer Science
d(i,j)
Temporal Matching: Continuous Dynamic Programming
Input
Mod
el
time j
time i
Qj=(xj,yj)
Mi=(xi,yi)
Local Cost: d(i,j)=L2(Mi,Qj)
Mm
M1
Computer Science
Temporal Matching: Continuous Dynamic Programming
D(i,j)
Input
Mod
el
time j
time i Cumulative Cost: D(i,j)=d(i,j)+min{D(i-1,j), D(i,j-1), D(i-1,j-1)}
Mi=(xi,yi)
Qj=(xj,yj)
M1
Mm
Computer Science
Gesture Spotting: Related Work
Direct approach [Kang et. al 04, Kahol et. Al 04]
Spotting precedes recognition.
1. Compute low-level motion parameters, such as velocity, acceleration, trajectory curvature.
2. Look for abrupt changes (zero-crossings) in those parameters to find candidate gesture boundaries.
Computer Science
Spotting: Detection of candidate gesture end point
0
2
9
Detectionthreshold
Dg(mg,j)DP tables
Computer Science
Spotting Definitions
Paths Complete Path W(M1:m, Qj’:j): a legal warping path
matching the input subsequence Qj’:j with the complete model M1:m.
Partial Path W(M1:i, Qj’:j): a legal warping path matching the input subsequence Qj’:j with part of the model M1:m.
Active Path: a partial path that has not been pruned. Models
Active Model g: a model that has a complete path ending at the current input frame j.
Firing Model g: an active model with a cost below the detection acceptance threshold.
Subgesture Relationship: a gesture g1 is a subgetsure of gesture g2 if it is properly contained in g2. In this case, g2 is a supergesture of g1.
Computer Science
Nested Gestures
time j
matching cost
Computer Science
Nested Gestures
Which gesture to recognize? 7 or 3? 5 or 8? 1 or 9?
Computer Science
Nested Gestures
Which gesture to recognize? Solution: store a subgesture table. 5 or 8?
If a gesture is firing then if at least one of its supergestures is firing then wait; otherwise, recognize it.
If a gesture is firing and it has no supergestures then recognize it.
Subgesture Supergesture
0 9
1 4,7,9
4 2,5,6,8,9
5 8
7 2,3,9
Computer Science
Nested Gestures
Which gesture to recognize? 5 or 8? 7 or 3? 1 or 9?
Solution: store a subgesture table. If a gesture is firing then if at least one of its
supergestures is firing then wait; otherwise, recognize it.
If a gesture is firing and it has no supergestures then recognize it.
Computer Science
Temporal Matching: Continuous Dynamic Programming (CDP)
Input
Mod
el
time j
time i
Qj=(xj,yj)
Mi=(xi,yi)
Mm
M1
Qjs
W
D(m,j)
Computer Science
Temporal Matching: Continuous Dynamic Programming (CDP)
Input
Mod
el
time j
time i
Qj=(xj,yj)
Mi=(xi,yi)
Mm
M1
Local Cost: d(i,j)=L2(Mi,Qj)
d(i,j)
Computer Science
Temporal Matching: Continuous Dynamic Programming (CDP)
Input
Mod
el
time j
time i
Qj=(xj,yj)
Mi=(xi,yi)
Mm
M1
D(i,j)
Cumulative Cost: D(i,j)=d(i,j)+min{D(i-1,j), D(i,j-1), D(i-1,j-1)}
Computer Science
How to Prune?
Input (digit “6”)
Mod
el (
digi
t “6”
)
Legend:White: pruned cells•d(Mi,Qj) > 100 pixBlack: visited cellsRed: optimal path
Computer Science
How to Prune?
Input (digit “6”)
Mod
el (
digi
t “6”
)
Legend:White: pruned cells•d(Mi,Qj) > 40 pixBlack: visited cellsRed: optimal path
Computer Science
How to Prune?
Input (digit “6”)
Mod
el (
digi
t “6”
)
Legend:White: pruned cells•d(Mi,Qj) > 30 pixBlack: visited cellsRed: optimal path
Computer Science
How to Prune?
Input (digit “6”)
Mod
el (
digi
t “6”
)
Legend:White: pruned cells•d(Mi,Qj) > 20 pixBlack: visited cellsRed: optimal path
Computer Science
How to Prune?
Input (digit “6”)
Mod
el (
digi
t “6”
)
Legend:White: pruned cells•d(Mi,Qj) > 20 pixBlack: visited cellsRed: optimal path
Computer Science
How to Prune?
Input (digit “6”)
Mod
el (
digi
t “6”
)
Legend:White: pruned cells•d(Mi,Qj) > 15 pixBlack: visited cellsRed: optimal path
Computer Science
How to Prune?
Input (digit “6”)
Mod
el (
digi
t “6”
)
Legend:White: pruned cells•d(Mi,Qj) > 10 pixBlack: visited cellsRed: optimal path
Computer Science
How to Prune?
Answer: learn classifiers (thresholds ): maximize pruning s.t. minimize expectation of pruning the optimal path.
Input (digit “6”)
Mod
el (
digi
t “6”
)
Legend:White: pruned cells• pixBlack: visited cellsRed: optimal path
i
iji QMd ),(
Computer Science
Temporal Matching: Continuous Dynamic Programming (CDP)
Input
Mod
el
time j
time i
Computer Science
Temporal Matching: Continuous Dynamic Programming (CDP)
Input
Mod
el
time j
time i
W
D(m,j)
Computer Science
Computer Science
Demo