beyond controllers - human segmentation, pose, and depth estimation as game input mechanisms
TRANSCRIPT
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
1/166
Beyond Controllers
Human Segmentation, Pose, and Depth Estimationas Game Input Mechanisms
Glenn Sheasby
Thesis submitted in partial fulfilment of the requirements of the award of
Doctor of Philosophy
Oxford Brookes University
in collaboration with Sony Computer Entertainment Europe
December 2012
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
2/166
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
3/166
Abstract
Over the past few years, video game developers have begun moving away from the tradi-
tional methods of user input through physical hardware interactions such as controllers or
joysticks, and towards acquiring input via an optical interface, such as an infrared depth
camera (e.g.the MicrosoftKinect) or a standard RGB camera (e.g.the PlayStation Eye).
Computer vision techniques form the backbone of both input devices, and in this thesis,
the latter method of input will be the main focus.In this thesis, the problem of human understanding is considered, combining segment-
ation and pose estimation. While focussing on these tasks, we examine the stringent
challenges associated with the implementation of these techniques in games, noting par-
ticularly the speed required for any algorithm to be usable in computer games. We also
keep in mind the desire to retain information wherever possible: algorithms which put
segmentation and pose estimation into a pipeline, where the results of one task are used
to help solve the other, are prone to discarding potentially useful information at an early
stage, and by sharing information between the two problems and depth estimation, we
show that the results of each individual problem can be improved.
We adapt Wang and Kollers dual decomposition technique to take stereo information
into account, and tackle the problems of stereo, segmentation and human pose estimation
simultaneously. In order to evaluate this approach, we introduce a novel, large dataset
featuring nearly 9,000 frames of fully annotated humans in stereo.
Our approach is extended by the addition of a robust stereo prior for segmenta-
tion, which improves information sharing between the stereo correspondence and human
segmentation parts of the framework. This produces an improvement in segmentation
results. Finally, we increase the speed of our framework by a factor of 20, using a highly
efficient filter-based mean field inference approach. The results of this approach compare
favourably to the state of the art in segmentation and pose estimation, improving on the
best results in these tasks by 6.5% and 7% respectively.
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
4/166
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
5/166
Acknowledgements
Okay... now what?
(Mike Slackenerny, PhD comic #844)
It is finished. Although the PhD thesis is a beast that must be tamed in solitude, I
dont believe its something that can be done entirelyalone, and there are many people
to whom I owe a debt of gratitude.
My supervisor, Phil Torr, made it possible for me to get started in the first place,
and gave me immeasurable help along the way. While were talking about how I came to
be doing a PhD, I should also thank my old boss, Andrew Stoddart, who recommended
that I apply, and the recession for costing me the software job I was doing after leaving
student life for the first time. I guess my escape velocity wasnt high enough, moving
only two miles from my first alma mater. Im about 770 miles away now, so that should
be enough!
My colleagues at Brookes also helped immensely, from those who helped me settle
in: David Jarzebowski, Jon Rihan, Chris Russell, Lubor Ladicky, Karteek Alahari, Sam
Hare, Greg Rogez, and Paul Sturgess; to those who saw me off at the end of it: Paul
Sturgess, Sunando Sengupta, Michael Sapienza, Ziming Zhang, Kyle Zheng, and Ming-
Ming Cheng. Special thanks are due to Morten Lindegaard, who proof-read large chunksof this thesis, and to my co-authors: Julien Valentin, Vibhav Vineet, Jonathan Warrell,
and my second supervisor, Nigel Crook.
Financial support from the EPSRC partnership with Sony is gratefully acknowledged,
and weekly meetings and regular feedback from Diarmid Campbell helped to guide and
focus my research. Furthermore, Amir Saffari and the rest of the crew at SCEE London
Studio provided a dataset, as well as feedback from a professional perspective.
Id also like to thank my examiners, Teo de Campos, Mark Bishop, and David Duce,
for taking the time to read my thesis, and for providing useful feedback and engaging
discussion during the viva.
While struggling through my PhD years, I was kept sane in Oxford by a variety ofgroups, including the prayer group at St. Mary Magdalenes, Brookes Ultimate Frisbee,
and of course, the Oxford University Bridge Club, where I spent many Monday even-
ings exercising my mind (and liver), and where I met my wonderful fiance, the future
Dr. Mrs. Dr. Sheasby, Aleksandra: wszystkie nasze sukcesy s wsplne, ale ten jeden za-
wdziczam wycznie Tobie. Wierzya we mnie nawet wtedy, kiedy ja sam w siebie nie
wierzyem i za to bd Ci wdziczny do koca ycia.
Lastly, but most importantly, Id like to thank my parents, for raising me, for sup-
porting me in all of my endeavours, and for teaching me to question everything.
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
6/166
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
7/166
Contents
List of Figures 7
List of Tables 9
List of Algorithms 11
1 Introduction 13
1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2 Vision in Computer Games: A Brief History 19
2.1 Motion Sensors: Nintendo Wii . . . . . . . . . . . . . . . . . . . . . . . . 20
2.1.1 Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 RGB Cameras: EyeToy and Playstation Eye . . . . . . . . . . . . . . . . 22
2.2.1 Early Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.2 Antigrav . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.3 Eye of Judgment . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.4 EyePet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2.5 Wonderbook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3 Depth Sensors: Microsoft Kinect . . . . . . . . . . . . . . . . . . . . . . 28
2.3.1 Technical Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.2 Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.3 Vision Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3.4 Overall Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3 State of the Art in Selected Vision Algorithms 33
3.1 Inference on Graphs: Energy Minimisation . . . . . . . . . . . . . . . . . 34
3
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
8/166
Contents
3.1.1 Conditional Random Fields . . . . . . . . . . . . . . . . . . . . . 34
3.1.2 Submodular Terms . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.1.3 The st-Mincut Problem . . . . . . . . . . . . . . . . . . . . . . . . 363.1.4 Application to Image Segmentation . . . . . . . . . . . . . . . . . 38
3.2 Inference on Trees: Belief Propagation . . . . . . . . . . . . . . . . . . . 40
3.2.1 Message Passing . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2.2 Belief Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3 Human Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.1 Pictorial Structures . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3.2 Flexible Mixtures of Parts . . . . . . . . . . . . . . . . . . . . . . 47
3.3.3 Unifying Segmentation and Pose Estimation . . . . . . . . . . . . 50
3.4 Stereo Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523.4.1 Humans in Stereo . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4 3D Human Pose Estimation in a Stereo Pair of Images 57
4.1 Joint Inference via Dual Decomposition . . . . . . . . . . . . . . . . . . . 58
4.1.1 Introduction to Dual Decomposition . . . . . . . . . . . . . . . . 59
4.1.2 Related Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.2 Humans in Two Views (H2view) Dataset . . . . . . . . . . . . . . . . . . 67
4.2.1 Evaluation Metrics Used . . . . . . . . . . . . . . . . . . . . . . . 68
4.3 Inference Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.3.1 Segmentation Term . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.3.2 Pose Estimation Term . . . . . . . . . . . . . . . . . . . . . . . . 77
4.3.3 Stereo Term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.3.4 Joint Estimation of Pose and Segmentation . . . . . . . . . . . . . 81
4.3.5 Joint Estimation of Segmentation and Stereo . . . . . . . . . . . . 82
4.4 Dual Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.4.1 Binarisation of Energy Functions . . . . . . . . . . . . . . . . . . 84
4.4.2 Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.4.3 Solving Sub-ProblemL1 . . . . . . . . . . . . . . . . . . . . . . . 884.4.4 Solving Sub-ProblemL2 . . . . . . . . . . . . . . . . . . . . . . . 88
4.4.5 Solving Sub-ProblemL3 . . . . . . . . . . . . . . . . . . . . . . . 89
4.5 Weight Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.5.1 Assumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.5.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.6.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.6.2 Runtime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
9/166
Contents
5 A Robust Stereo Prior for Human Segmentation 103
5.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.1.1 Range Move Formulation . . . . . . . . . . . . . . . . . . . . . . . 1065.2 Flood Fill Prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.3 Application: Human Segmentation . . . . . . . . . . . . . . . . . . . . . 112
5.3.1 Original Formulation . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.3.2 Stereo TermfD . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.3.3 Segmentation TermsfS and fSD . . . . . . . . . . . . . . . . . . . 114
5.3.4 Pose Estimation Terms fP and fPS . . . . . . . . . . . . . . . . . 115
5.3.5 Energy Minimisation . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.3.6 Modifications toD Vector . . . . . . . . . . . . . . . . . . . . . . 118
5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1185.4.1 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.4.2 Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.4.3 Runtime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6 An Efficient Mean Field Based Method for Joint Estimation of Human
Pose, Segmentation, and Depth 125
6.1 Mean Field Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.1.1 Introduction to Mean-Field Inference . . . . . . . . . . . . . . . . 128
6.1.2 Simple Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . 1306.1.3 Performance Comparison: Mean Field vs Graph Cuts . . . . . . . 131
6.2 Model Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.2.1 Joint Energy Function . . . . . . . . . . . . . . . . . . . . . . . . 132
6.3 Inference in the Joint Model . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
6.4.1 Segmentation Performance . . . . . . . . . . . . . . . . . . . . . . 137
6.4.2 Pose Estimation Performance . . . . . . . . . . . . . . . . . . . . 137
6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
7 Conclusions and Future Work 1437.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.2 Directions for Future Research . . . . . . . . . . . . . . . . . . . . . . . . 144
Bibliography 147
5
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
10/166
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
11/166
List of Figures
2.1 Duck Hunt screenshot . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 WiiSensor Bar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Putting action fromWii Sports. . . . . . . . . . . . . . . . . . . . . . . 21
2.4 EyeToyand Playstation Eyecameras. . . . . . . . . . . . . . . . . . . . . 22
2.5 EyeToy: Play . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.6 Antigrav . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.7 Eye of Judgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.8 EyePet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.9 Wonderbookdesign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.10 AWonderbook scene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.11 Kinectgames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.12 Furniture removal guidelines in Kinect instruction manual . . . . . . . . 30
3.1 Image for our toy example. . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 Segmentation results on the toy image . . . . . . . . . . . . . . . . . . . 40
3.3 Skeleton Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.4 Part models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.5 Yang-Ramanan skeleton model . . . . . . . . . . . . . . . . . . . . . . . 49
3.6 Stereo example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.1 Subgradient example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2 Dual functions versus the cost variable. . . . . . . . . . . . . . . . . . . 66
4.3 Values of the dual functiong() . . . . . . . . . . . . . . . . . . . . . . . 66
4.4 Accuracy of Part Proposals . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.5 CRF Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.6 Foreground weightings on a cluttered image from the Parse dataset . . . 76
4.7 Results using just fS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.8 Part selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.9 Limb recovery due toJ1 term . . . . . . . . . . . . . . . . . . . . . . . . 83
7
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
12/166
List of Figures
4.10 Master-slave update process . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.11 Decision tree: parameter optimisation . . . . . . . . . . . . . . . . . . . . 94
4.12 Sample stereo and segmentation results . . . . . . . . . . . . . . . . . . . 974.13 Segmentation results on H2view . . . . . . . . . . . . . . . . . . . . . . . 98
4.14 Results from H2view dataset . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.1 Flood fill example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.2 Three successive range expansion iterations . . . . . . . . . . . . . . . . . 109
5.3 The new master-slave update process . . . . . . . . . . . . . . . . . . . . 117
5.4 Segmentation results on H2View . . . . . . . . . . . . . . . . . . . . . . . 119
5.5 Comparison of segmentation results on H2View . . . . . . . . . . . . . . 120
5.6 Failure cases of segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.1 Segmentation of theTreeimage . . . . . . . . . . . . . . . . . . . . . . . 127
6.2 Basic 6-part skeleton model . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.3 Segmentation results on H2view compared to other methods . . . . . . . 138
6.4 Further segmentation results . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.5 Qualitative results on H2View dataset . . . . . . . . . . . . . . . . . . . 141
8
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
13/166
List of Tables
4.1 Table of Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2 Evaluation offS only . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.3 Evaluation offS combined with fPS . . . . . . . . . . . . . . . . . . . . . 82
4.4 List of weights learned . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.5 Evaluation of segmentation performance . . . . . . . . . . . . . . . . . . 96
4.6 Dual Decomposition results . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.1 Segmentation results on the H2View dataset . . . . . . . . . . . . . . . . 121
5.2 Results (given in % PCP) on the H2view test sequence. . . . . . . . . . . 122
6.1 Evaluation of mean field on the MSRC-21 dataset . . . . . . . . . . . . . 131
6.2 Quantitative segmentation results on the H2View dataset . . . . . . . . . 137
6.3 Pose estimation results on the H2View dataset . . . . . . . . . . . . . . . 140
9
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
14/166
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
15/166
List of Algorithms
4.1 Parameter optimisation algorithm for dual decomposition framework. . . 935.1 Generic flood fill algorithm for an imageIof size W H. . . . . . . . . 110
5.2 doLinearFill: perform a linear fill from the seed point (sx, sy). . . . . . 111
6.1 Nave mean field algorithm for fully connected CRFs . . . . . . . . . . . 130
11
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
16/166
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
17/166
Chapter1
Introduction
Over the past several years, a wide range of commercial applications of computer vision
have begun to emerge, such as face detection in cameras, augmented reality (AR) in shop
displays, and the automatic construction of image panoramas. Another key application
of computer vision that has become popular recently is computer games, with commercial
products such as Sonys Playstation Eyeand Microsofts Kinectselling millions of units
[98].
In creating these products, video game developers have been able to partially expand
the demographic of players. They have done this by moving away from the traditional
controller pad method of user input, and enabling the player to control the game using
other objects, such as books or AR markers, and even their own bodies. Some of the
most popular games that are either partially or completely driven using human motion
include sports games such as Wii Sports and Kinect Sports, and party games such as
the EyeToy: Playseries. More recent games, such as EyePetand Wonderbook: Book of
Spells, combine motion information with object detection. A more thorough description
of these games can be found in Chapter 2.
Three main computer vision techniques are used to obtain input instructions for these
games: motion detection, object detection, and human pose estimation. The first of these,
motion detection, involves detecting changes in image intensity across several frames; in
video games, motion detection is used in particular areas of the screen as the player
13
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
18/166
1.1. Contributions
attempts to complete tasks. Secondly,object detectioninvolves determining the presence,
position and orientation of particular objects in the frame. The object can be a simple
shape (e.g. a quadrilateral) or a complex articulated object, such as a cat. In certain
video games, the detection of AR markers is used to add computer graphics to an image
of the players surroundings. Finally, the goal ofhuman pose estimation is to determine
the position and orientation of each of a persons body parts. Using images obtained via
an infrared depth sensor, Kinectgames can track human poses over several frames, in
order to detect actions [110].
Theoretically, an image contains a lot more information than a controller can sup-ply. However, the player can only provide information via a relatively limited set of
actions, either with their own body, or using some kind of peripheral object which can
be recognised.
The main aim of this thesis is to explore and expand the applicability of human
pose estimation to video games. After an analysis of the techniques that have already
been used, and of the current state of these techniques in research, our main application
will be presented. Using a stereo pair of cameras, we will develop a system that unifies
human segmentation, pose estimation, and depth estimation, solving the three tasks
simultaneously. In order to evaluate this system, we will present a large dataset containing
stereo images of humans in indoor environments where video games might be played.
1.1 Contributions
In summary, the principal contributions of this thesis are as follows:
A system for the simultaneous segmentation and pose estimation of humans, as
well as depth estimation of the entire scene. This system is further developed by
the introduction of a stereo-based prior; the speed of the system is subsequently
improved by applying a state-of-the-art approximate inference technique.
The introduction of a novel, 9,000 image dataset of humans in two views.
14
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
19/166
1.2. Outline of the Thesis
Throughout the thesis, the pronoun we is used instead of I. This is done to follow
scientific convention; the contents of this thesis are the work of the author. Where others
have contributed towards the work, their collaborations will be attributed in a short
section at the end of each chapter.
1.2 Outline of the Thesis
Chapter 2contains a description of some of the various attempts that games developers
have made to provide alternatives to controllers, and the impact that these games have
had on the video games community. Starting with the accelerometer and infrared detec-
tion based solutions provided by the Nintendo Wii, we observe the increasing amount
of integration of vision techniques, with this trend demonstrated by the methods used
by SonysEyeToyandPlayStation Eye-based games over the past several years. Finally,
we consider the impact that depth information can have in enabling the software to
determine the pose of the players body, as shown by the Microsoft Kinect.
Following on from that,Chapter 3contains an appraisal of related work in computer
vision that might be applied in computer games. We consider the different approaches
commonly used to solve the problems of segmentation and human pose estimation, and
give an overview of some of the approaches that have been used to provide 3D information
given a pair of images from a stereo camera.
Chapter 4 describes a novel framework for the simultaneous depth estimation of a
scene, and segmentation and pose estimation of the humans within that scene. Using a
stereo pair of images as input provides us with the ability to compute the distance of each
pixel from the camera; additionally, we can use standard approaches to find the pixels
occupied by the human, and predict its pose. In order to share information between
these three approaches, we employ a dual decomposition framework [62,127]. Finally, to
evaluate the results obtained by our method, we introduce a new dataset, called Humans
in Two Views, which contains almost 9,000 stereo pairs of images of humans.
InChapter 5, we extend this approach to improve the quality of information shared
15
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
20/166
1.3. Publications
between the segmentation and depth estimation parts of the algorithm. Observing that
the human occupies a continuous region of the cameras field of view, we infer that the
distance of human pixels from the camera will vary only in certain ways, without sharp
boundaries (we say that the depth issmooth). Therefore, starting from pixels that we are
very confident lie within the human, we can extract a reliable initial segmentation from
the depth map, significantly improving the overall segmentation results.
The drawback of the dual decomposition-based approach, however, is that it is much
too slow to be used in computer games. In Chapter 6, we adapt our framework in order
to apply an approximate, but very fast, inference approach based on mean field [64]. Ouruse of this new inference approach enables us to improve the information sharing between
the three parts of the framework, providing an improvement in accuracy, as well as an
order-of-magnitude speed improvement.
While the mean-field inference approach is much quicker than the dual decomposition-
based approach, its speed (close to 1 fps) is still not fast enough for real-time application
such as computer games. In Chapter 7, the thesis concludes with some suggestions for
how to further improve the speed, as well as some other promising possible directions for
future research. The concluding chapter also contains a summary of the work presented
and contributions made.
1.3 Publications
Several chapters of this thesis first appeared as conference publications, as follows:
G. Sheasby, J. Warrell, Y. Zhang, N. Crook, and P.H.S. Torr. Simultaneous hu-
man segmentation, depth and pose estimation via dual decomposition. In British
Machine Vision Conference, Student Workshop, 2012. (Chapter 4, [108])
G. Sheasby, J. Valentin, N. Crook, and P.H.S. Torr. A robust stereo prior for human
segmentation. InAsian Conference on Computer Vision (ACCV), 2012. (Chapter
5, [107])
16
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
21/166
The contributions of co-authors are acknowledged in the corresponding chapters. The first
paper [108] received the best student paper award at the BMVC workshop. Addition-
ally, some sections of Chapter 6 form part of a paper that is currently under submission
at a major computer vision conference.
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
22/166
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
23/166
Chapter2
Vision in Computer Games: A Brief History
Figure 2.1: A screenshot [84] from Duck Hunt, an early example of a game that used
sensing technology.
The purpose of a game controller is to convey the users intentions to the game. A wide
varieties of input methods, for instance a mouse and keyboard, a handheld controller,
or a joystick, have been employed for this purpose. Video games using some sort of
sensing technology (instead of, or in addition to, those listed above) have been available
for several decades. In 1984, Nintendo released alight gun, which detects light emitted by
CRT monitors; this release was made popular by the game Duck Huntfor the Nintendo
19
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
24/166
2.1. Motion Sensors: Nintendo Wii
Figure 2.2: The sensor bar, which emits infrared light that is detected byWii remotes.The picture [81] was taken with a camera sensitive to infrared light; the LEDs are notvisible to the human eye.
Entertainment System (NES), in which the player aimed the gun at ducks that appeared
on the screen (Figure 2.1). When the trigger is fired, the screen is turned black for one
frame, and then the target area is turned white in the next frame. If it is pointed at the
correct place, the gun detects this change in intensity, and registers a hit.
Over the past few years, technological developments have made it easier for video
game developers to incorporate sensing devices to augment, or in some cases replace, the
traditional controller pad method of user input. These devices include motion sensors,
RGB cameras, and depth sensors. The following sections give a brief summary of the
applications of each in turn.
2.1 Motion Sensors: Nintendo Wii
TheWiiis a seventh-generation games console that was released by Nintendo in late 2006.
Unlike previous consoles, the unique selling point of the Wiiwas a new form of player
interaction, rather than greater power or graphics capability. This new form of interaction
was theWii Remote, a wireless controller with motion sensing capabilities. The controller
contains an accelerometer, enabling it to sense acceleration in three dimensions, and an
infrared sensor, which is used to determine where the remote is pointing [49].
Unlike light guns, which sense light from CRT screens, the remote detects light from
the consoles sensor bar, which features ten infrared LEDs (Figure 2.2). The light from
each end of the bar is detected by the remotes optical sensor as two bright lights. Trian-
gulation is used to determine the distance between the remote and the sensor bar, given
20
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
25/166
2.1. Motion Sensors: Nintendo Wii
Figure 2.3: An example of the use of the Wii Remotes motion sensing capabilities tocontrol game input. Here, the player moves the remote as he would move a putter whenplaying golf. The power of the putt is determined by the magnitude of the swing [74].
the observed distance between the two bright lights and the known distance between the
LED arrays.
The capability of the Wiito track position and motion enables the player to mimic
actual game actions, such as swinging a sword or tennis racket. This capability is demon-
strated by games such as Wii Sports, which was included with the games console in the
first few years after its release. The remote can be used to mimic the action of bowling
a ball, or swung like a tennis racket, a baseball bat, or a golf club (Figure 2.3).
2.1.1 Impact
The player still uses a controller, although in some games, like Wii Sports, it is now
the position and movement of the remote that is used to influence events in-game. This
21
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
26/166
2.2. RGB Cameras: EyeToy and Playstation Eye
(a) EyeToy [114] (b) PlayStation Eye [32]
Figure 2.4: The two webcam peripherals released by Sony.
makes playing the games more tiring than before, especially if they are played with
vigour. However, a positive effect is that the control system is more intuitive, meaning
that people who dont normally play traditional video games might still be interested in
owning a Wiiconsole [103].
2.2 RGB Cameras: EyeToy and Playstation Eye
While Nintendos approach uses the position and motion of the controller to enhance
gameplay, other games developers have made use of the RGB images provided by cameras.
The first camera released as a games console peripheral and used as an input device for a
computer game was the EyeToy, which was released for the PlayStation 2 (PS2) in 2003.
This was followed in 2007 by the PlayStation Eye (PS Eye) for thePlayStation 3 (PS3).
Some of Sonys recent games have used the PlayStation Move (PS Move) in addition
to the PS Eye. The Move is a handheld plastic controller which has a large, bright ball
on the top; the hue of this ball can be altered by the software. During gameplay, the ball
is easily detectable by the software, and is used as a basis for determining the position,
orientation and motion of the PS Move[113].
The degree to which vision techniques have been applied toEyeToygames has varied
widely. Some games only use the camera to allow the user to see themselves, whereas
others require significant levels of image processing. The following sections contain de-
22
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
27/166
2.2. RGB Cameras: EyeToy and Playstation Eye
(a) Ghost Catcher [48] (b) Keep Up [117] (c) Kung Foo [42]
Figure 2.5: Screenshots of three mini-games from EyeToy: Play.
scriptions of some of the games that have used image processing to enhance gameplay.
2.2.1 Early Games
In its original release, the EyeToywas released in a bundle with EyeToy: Play, which
features twelve mini-games. The game play is simplistic, as is common with party-oriented
video games. Many of them rely on motion detection; for instance, the object of Ghost
Catcher is to fill ghosts with air and then pop them, and this is done by repeatedly
waving your hands over them. Others, such as Keep Up, use human detection; the
player is required to keep a ball in the air. Therefore, the game needs to determine
whether there is a person in the area where the ball is.
A third use of vision in this game occurs in Kung Foo; in this mini-game, the
player stands in the middle of the cameras field of view, and is instructed to hit ninjas
that fly onto the screen from various directions. Again, motion detection can be used to
determine whether a hit has been registered, as it doesnt matter which body part was
used to perform the hit.
Impact
As the mini-games in EyeToy: Play only require simplistic image understanding tech-
niques, specifically the detection of motion within a small portion of the cameras field
of view, the underlying techniques seemed to work well. As with Wii Sports, the game
was aimed at casual gamers rather than traditional, or hardcore, gamers.
23
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
28/166
2.2. RGB Cameras: EyeToy and Playstation Eye
(a) Antigravscreenshot (b) Close-up of user display
Figure 2.6: A screenshot [91] from Antigrav, where the player has extended their right
arm to grab an object (the first of three) and thus score some points. The user displayshows where the game has detected the players hands to be.
2.2.2 Antigrav
Antigrav, a PS2game that utilises the EyeToy, is a futuristic trick-based snowboarding
game, and was brought out by Harmonix in late 2004. The player takes control of a
character in the game, and guides them down a linear track. The game uses face tracking
to control the characters movements, enabling the player to increase the characters speedby ducking, and change direction by leaning. In addition, the players hands are tracked,
and their hand position is used to infer a pose, enabling the player to literally grab for
collectible objects on-screen. The player can see what the computer calculates their head
and hand positions to be in the form of a small diagram in the corner of the screen, as
shown in Figure 2.6. A GameSpot review [24] points out:
this is good for letting you know when the EyeToyis misreading your move-
ments, which takes place more often than it ought to.
The review, like other reviews of PS2 EyeToy releases, hints at further technological
limitations impairing the enjoyment of the game:
Harmonix pushes the limits of what you should expect from an EyeToy
entry... unfortunately,EyeToypushes back, and its occasional inconsistency
hobbles an otherwise bold and enjoyable experience.
24
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
29/166
2.2. RGB Cameras: EyeToy and Playstation Eye
Impact
The reviews above imply that the head and hand detection techniques employed by
the game were not completely effective, meaning that users are often frustrated by their
actions not being recognised by the game due to failure of the tracking system. This high-
lights the importance of accuracy when developing vision algorithms for video games: if
your tracking algorithm fails around 5% of the time, then the 95% accuracy is, quantit-
atively, extremely good. However, during a 3-minute run down a track on Antigrav, this
could result in a failure of the tracking system, taking several seconds to recover from.
This would be clearly noticeable by gamers.
2.2.3 Eye of Judgment
Figure 2.7: An image [3] showing the set-up ofEye of Judgment. The camera is pointedat a cloth, on which several cards are placed. These cards are recognised by the game,and the on-screen display shows the objects or creatures that the cards represent.
In 2007, Sony released Eye of Judgment, a role-playing card-game simulation that can be
compared to the popular card game Magic: The Gathering. The PS3game comes with
a cloth, and a set of cards with patterns on them that the computer can easily recognise
25
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
30/166
2.2. RGB Cameras: EyeToy and Playstation Eye
(Figure 2.7). It can recognise the orientation as well as the identity of the cards, enabling
them to have different functions when oriented differently. A review reported very
few hardware-related issues, principally because of the pattern-based card recognition
system [122].
Since then, the PS3saw very little PS Eye-related development before the release of
EyePetin October 2009; in the two years between the releases ofEye of Judgmentand
EyePet, the use of the PS Eyewas generally limited to uploading personalised images for
game characters.
2.2.4 EyePet
(a) EyePets AR marker (b) EyePetwith trampoline
Figure 2.8: An example of augmented reality being used in EyePet [4].
EyePetfeatures a virtual pet, which interacts with people and objects in the real world
using fairly crude motion sensing. For example, if the player rolls a ball towards the pet,
it will jump out of the way. Another major feature of the game is the use of augmented
reality: a card with a specific pattern is detected in the cameras field of view, and a
magic toy (a virtual object that the pet can interact with, such as a trampoline, a
bubble-blowing monkey, or a tennis player) is shown on top of the card (see Figure 2.8).
26
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
31/166
2.2. RGB Cameras: EyeToy and Playstation Eye
Impact
Again,EyePetuses fairly simplistic vision techniques, with marker detection and a motion
buffer being used throughout the game. This prevents it from receiving the sort of
criticism that was associated with Antigrav.
Although it was generally well-received, even EyePetdid not escape criticism for the
limitations of its technology, which cant help but creak at times according to a review
published in Eurogamer [129]. The review goes on to say that performance is robust
under strong natural light, but patchy under electric light in the evening.
This sort of comment shows the unforgivingness of video gamers, or at least of video
game reviewers: for a vision technique to be useful in a game, it needs to be able to work
under a very wide variety of environments and lighting conditions.
2.2.5 Wonderbook
(a) (b)
Figure 2.9: The Wonderbook (a)[82] is used with the PlayStation Movecontroller. Theinterior(b)[115] features AR markers, as well as markings on the border to identify theedge of the book, and ones near the edge of the page, which help to identify the pagequickly.
Wonderbook: Book of Spells, released by Sony in November 2012, is the first in an up-
coming series of games that will use computer vision methods to enhance gameplay. The
games will be centred upon a book whose pages contain augmented reality markers and
other patterns (Figure 2.9). These are detected by various pattern recognition techniques,
in order to determine where the book is, and which pages are currently visible. Once
27
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
32/166
2.3. Depth Sensors: Microsoft Kinect
Figure 2.10: After the Wonderbook is detected, gameplay objects can be overlaid on-screen. In this image [116], a 3D stage is superimposed onto the book.
this is known, augmented reality can be used to replace the image of the book with, for
example, a burning stage (Figure 2.10).
In Book of Spells, the book becomes a spell book, and through the gameplay, spells
from the Harry Potter series are introduced [53]. At various points in the game, the
player must interact with the book, for example to put out fires by patting the book.
Skin detection algorithms are used to ensure that the players hands appear to occlude
the spellbook, rather than going through it.
The generality of the book enables it to be used in multiple different kinds of games.
BBCs Walking with Dinosaurswill be made into an interactive documentary, with the
player first excavating and completing dinosaur skeletons, and then feeding the dinosaurs
using the PS Move[55]. It remains to be seen how the final versions of these games will
be appraised by reviewers and customers, and thus whether the Wonderbook franchise
will have a significant impact on the video gaming market.
2.3 Depth Sensors: Microsoft Kinect
While RGB cameras can be useful in enhancing gameplay with vision techniques, the
extra information provided by depth cameras makes it significantly easier to determine
the structure of a scene. This enables games developers to provide a new way of playing.
WithKinect, which was released in November 2010, Microsoft offer a system where you
28
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
33/166
2.3. Depth Sensors: Microsoft Kinect
(a) (b) (c)
Figure 2.11: A selection of different games available for Kinect. (a) Kinect Sports[78]: twoplayers compete against each other at football. (b) Dance Central [7]: players performdance moves, which are tracked and judged by the game. (c) Kinect Star Wars [99]:players swing a lightsabre by making sweeping movements with their arm.
are the controller. Using an infrared depth sensor to track 3D movement, they generate
a detailed map of the scene, significantly simplifying the task of, for example, tracking
the movement of a person.
2.3.1 Technical Details
TheKinectprovides a 320 240 16-bit depth image, and a 640 480 32-bit RGB image,both running at 30 frames per second (fps); the depth sensor has an active range of 1.2 to
3.5 metres [93]. The skeletal detection system, used for detecting the human body in each
frame, is based on random forest classifiers [110], and is capable of tracking a twenty-
link skeleton of up to two active players in real-time.1 The software also provides an
object-specific segmentation of the people in the scene (i.e.different people are segmented
separately), and further enhances the players experience by using person recognition to
provide greetings and content.
2.3.2 Games
As with the Nintendo Wii, the Kinectwas launched along with a sports game, namely
Kinect Sports. The controls are intuitive: the player makes a kicking motion in order
1In an articulated body, a link is defined as an inflexible part of the body. For example, if theflexibility of fingers and thumbs is ignored, each arm could be treated as three links, with one link each
for hand, forearm and upper arm.
29
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
34/166
2.3. Depth Sensors: Microsoft Kinect
Figure 2.12: The furniture removal guidelines in theKinectinstruction manual [79] advisethe player to move tables etc. that might block the cameras view, which may cause aproblem for some users.
to kick a football, or runs on the spot in the athletics mini-games. No controllers or
buttons are required, which makes the games very easy to adapt to, although some of
the movements need to be exaggerated in order for the game to recognise them [18].
Another intuitive game is Dance Central, which uses the Kinects full body tracking
capabilities to compare the players dance moves to those shown by an on-screen in-
structor. The object of the game is to imitate these moves in time with the music. This
can be compared to classic games like Dance Dance Revolution, with the difference that
the players whole body is now used, enabling a greater variety of moves and adding an
element of realism [128].
Up until now, games developers have struggled to produce a game that uses the
Kinects capabilities, yet still appeals to the serious gamer. One attempt was made
in the 2011 release Kinect Star Wars, in which the player uses their arms to control
a lightsaber, making sweeping or chopping motions to remove obstacles, and to defeat
enemies. However, this game was criticised due to the games inability to keep up with
fast and frantic arm motions [126].
A common problem with the Kinectmodel of gaming is that it is necessary to stand
a reasonable distance away from the camera (2 to 3 metres is the recommended range),
30
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
35/166
2.4. Discussion
which makes gaming very difficult in small rooms, especially as any furniture will need
to be moved away (Figure 2.12).
2.3.3 Vision Applications
Since its release, and the subsequent release of an open-source software development
kit [83], the Kinect has been used in a wide variety of non-gaming related work by
computer vision researchers. Oikonomidis et al. [87] developed a hand-tracking system
capable of running at 15 fps, while Izadi et al.[54] perform real-time 3D reconstruction of
indoor scenes by slowly moving the Kinectcamera around the room. TheKinecthas also
been shown to be a useful tool for easily collecting large amounts of training data [47].
However, due to IR interference, the depth sensor does not work in direct sunlight,
making it unsuitable for outdoor applications such as pedestrian detection [39].
2.3.4 Overall Impact
TheKinecthas had a huge impact worldwide, selling 19 million units worldwide in its first
eighteen months. This has helped Microsoft improve sales of the Xbox 360year-on-year,
despite the console now being in its seventh year. This is the reverse of the trend shown
by competing consoles [98]. The method of controlling games using the human body
rather than a controller is revolutionary, and the technology has also had a significant
effect on vision research, as mentioned in Section 2.3.3 above.
2.4 Discussion
To date, a number of vision methods that use RGB cameras have been introduced to
the video gaming community. However, these tend to be low-level (motion detection or
marker detection) rather than high-level: if the only information given is an RGB signal,
unconstrained object detection and human pose estimation are neither accurate nor fast
enough to be useful in video games.
31
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
36/166
2.4. Discussion
The depth camera used in the Microsoft Kinecthas provided a huge leap forward in
this area, although the cost of this peripheral (which had a recommended retail price of
129.99 at release, around four times more than the PS Eye) means that an improvement
in the RGB-based techniques would be desirable. The next chapter contains an appraisal
of related work in computer vision that might be of interest to games developers, and
provides background for this thesis.
32
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
37/166
Chapter3
State of the Art in Selected Vision
Algorithms
While we have seen in Chapter 2 that computer vision techniques are beginning to have a
profound effect on computer games, there are a number of research areas which could be
applied to further transform the gaming industry. Accurate object segmentation would
allow actual objects, or even people, to be taken directly from the players surroundings
and put into the virtual environment of the game. Human motion tracking could be
used to allow the player to navigate a virtual world, for instance by steering a vehicle.
Finally, human pose estimation could be used to allow the player to control an avatar in
a platform or role-playing game. In this chapter, we will discuss the current state of the
art in energy minimisation, human pose estimation, segmentation, and stereo vision.
In order for computer vision techniques like localisation and pose estimation to be
suitable for use in computer games, the algorithm that applies the technique needs to
respond in real time as well as being accurate. A fast algorithm is necessary because
the results (e.g. pose estimates) need to be used in real-time so that they can affect
the game in-play; very high accuracy is a requirement because mistakes made by the
game will undoubtedly frustrate the user (see [24] and Section 2.2.2). The problem is
to find a suitable balance between these two requirements (a faster algorithm might
involve approximate solutions, and hence could be less accurate). This may involve
33
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
38/166
3.1. Inference on Graphs: Energy Minimisation
tweaking existing algorithms to produce significant speed increases without any loss in
accuracy, or developing novel and significantly more accurate algortihms that still have
speed comparable to current state-of-the-art algorithms.
3.1 Inference on Graphs: Energy Minimisation
Many of the most popular problems in computer vision can be framed as energy minim-
isation problems. This requires the definition of a function, known as an energy function,
which expresses the suitability of a particular solution to the problem. Solutions that are
more probable should give the energy function a lower value; hence, we wish to find the
solution that gives the lowest value.
3.1.1 Conditional Random Fields
Suppose we have a finite set Vof random variables, to which we wish to assign labels from
a label set L. If all the variables are independent, then this problem is easily solvable
- just find the best label for each variable. However, in general we have relationships
between variables. LetEbe the set of pairs of variables {v1, v2} Vwhich are related to
one another.
We can then construct a graph G= (V, E) which specifies both the set of variables,
and the relationships between those variables. G is a directed graphif the pairs in E are
unordered; this enables us to construct graphs where, for some v1, v2, (v1, v2) E, but
(v2, v1) / E.
Given some observed data X, we can assign a set {yi : vi V} of values to the
variables in V. Let fdenote a function that assigns a label f(vi) = yi to each vi V.
Now, suppose that we also have a probability function p that gives us the probability of
34
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
39/166
3.1. Inference on Graphs: Energy Minimisation
a particular labelling {f(vi) :vi V}given observed dataX. Then:
Definition 3.1
(X, V) is a conditional random field if, when conditioned on X, the variables
Vobey the Markov propertywith respect to G:
p(f(vi) =yi|X, {f(vj) :j =i}) =p(f(vi) =yi|X, {f(vj) : (vi, vj) E}).
(3.1)
In other words, each output variable yi only depends on its neighbours [72].
3.1.2 Submodular Terms
Now we consider set functions, which are functions whose input is a set. For example,
suppose we have a set Y of possible variable values, and a set V, with size = |V|,
of variables vi which each take a value yi Y. A function f which takes as input an
assignment of these variables {yi:vi V} is a set function.
Energy functions are set functionsf :Y R+{0}, which take as input the variable
values {yi : vi V}, and output some non-negative real number. If the variable values
are binary, then this f is a binary set functionf : 2 R+ {0}.
Definition 3.2
A binary set function f : 2 R+ {0}is submodularif and only if for every
ordered setS, T Vwe have that:
f(S) + f(T) f(S T) + f(S T). (3.2)
35
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
40/166
3.1. Inference on Graphs: Energy Minimisation
For example, if = 2, S= [1, 0] and T = [0, 1], a submodular function will satisfy the
following inequality [104]:
f([1, 0]) + f([0, 1]) f([1, 1]) + f([0, 0]). (3.3)
From Schrijver [104], we also have the following proposition:
Proposition 3.1 The sum of submodular functions is submodular.
Proof It is sufficient to prove that, given two submodular functions f : A R+ {0}
and g : B R+ {0},h= f+ g:A B R+ {0}is submodular.
h(S) + h(T) =(f+ g)(S) + (f+ g)(T)
=f(S|A) + g(S|B) + f(T|A) + g(T|B)
= (f(S|A) + f(T|A)) + (g(S|B) + g(T|B))
(f((S T)|A) + f((S T)|A)) + (g((S T)|B) + g((S T)|B))
=f((S T)|A) + g((S T)|B) + f((S T)|A) + g((S T)|B)=h(S T) + h(S T).
As shown by Kolmogorov and Zabih [61], one way of minimising energy functions, par-
ticularly submodular energy functions, is via graph cuts, which we will now introduce.
3.1.3 The st-Mincut Problem
In this section, we will consider directed graphs G = (V, E) that have special nodes
s, t Vsuch that for all vi V\{s, t}, we have (s, vi) E, (vi, t) E, (vi, s) / E, and
(t, vi) / E. We say that s is the source nodeand t is the sink nodeof the graph. Such a
graph is also known as a flow network. Let c be a function c: E R+ {0}, where for
each (v1, v2) E, c(v1, v2) represents the capacity, or maximum amount of flow, of the
edge.
36
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
41/166
3.1. Inference on Graphs: Energy Minimisation
Max Flow
Definition 3.3
A flow function is a function f :E R+ {0} which satisfies the following
constraints:
1. f(v1, v2) c(v1, v2) (v1, v2) E
2.
v1:(v1,v)Ef(v1, v) =
v2:(v,v2)Ef(v, v2) v V.
The definition given above gives us two guarantees: first, that the flow passing along a
particular edge does not exceed that edges capacity; and second, that the flow entering
a vertex is equal to the flow leaving that vertex. From this second constraint, we can
derive the following:
Definition 3.4
Theflowof a flow function is the total amount passing from the source to the
sink, and is equal to
(s,v)Ef(s, v).
The objective of the max flow problemis to maximise the flow of a network, i.e. to find
a flow function fwith the highest flow.
Min Cut
Definition 3.5
An s-t cut C= (S, T) is a partition of the variables v V into two disjoint
sets Sand T, with s Sand t T.
37
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
42/166
3.1. Inference on Graphs: Energy Minimisation
LetE be the set of edges that connect a variable v1 Sto a variablev2 T. Formally:
E ={(v1, v2) E :v1 S, v2 T } (3.4)
Note that there are at least |V| 2 edges in E, as if v S\s, then (v, t) E, and
ifv T \t, then (s, v) E. Depending on the connectivity ofG, there may be up to
(|S| 1) (| T | 1) additional edges.
Definition 3.6
The capacityof an s-t cut is the sum of the capacity of the edges connecting
S toT, and is equal to
(v1,v2)Ec(v1, v2).
The objective of the min cut problem is to find an s-t cut which has minimal capacity
(there may be more than one solution).
In 1956, it was shown independently by Ford and Fulkerson [41] and by Elias et al.[30]
that the two problems above are equivalent. Therefore, to find a flow function that has
maximal flow, one needs only to find an s-t cut with minimal capacity. Algorithms that
seek to obtain such an s-t cut are known as graph cut algorithms. Submodular functions
can be efficiently minimised via graph cuts [15, 61]; C++ code is available that performs
this minimisation using an augmented path algorithm [14,58,61]. This code is often used
as a basis for image segmentation algorithms, for example [9,16,71,100,101,130].
3.1.4 Application to Image Segmentation
To illustrate the use of energy minimisation in image segmentation, consider the following
example. We have an image, shown in Figure 3.1, with just 9 pixels ( 3 3). To construct
a graph, we create a set of vertices V = {s,t,v1, v2, . . . , v9}, and a set of edges E, with
(s, vi)and (vi, t) E fori = 1to 9, and(vi, vj) E ifviandvj are adjacent in the image,
as shown in Figure 3.1. The vertices vi have pixel values pi between 0 and 255 inclusive
(i.e. the image is 8-bit greyscale, with 0 corresponding to black, and 255 to white). Our
38
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
43/166
3.1. Inference on Graphs: Energy Minimisation
Figure 3.1: Image for our toy example.
objective is to separate the pixels into foreground and background sets, i.e. to define a
labelling z= {z1, z2, . . . , z 9}, where zi = 1 if and only ifvi is assigned to the foreground
set.
We wish to separate the light pixels in the image from the dark ones, with the light
pixels in the foreground, so we create foreground and background penalties F and B
respectively for pixelsvi as follows:
F(vi) = 255 pi; (3.5)
B(vi) =pi. (3.6)
These are known as unary pixel costs. The total unary costof a labelling z is:
(z) =9
i=1
(zi F(vi) + (1 zi) B(vi)) . (3.7)
We also want the boundary of the foreground set to align with edges in the image. There-
fore, we wish to penalise cases where adjacent pixels have similar values, but different
labels. This is done by including a pairwise cost:
(z) =
(vi,vj)E
1(zi=zj) exp(|pi pj|), (3.8)
where 1 is the indicator function, which has a value of 1 if the statement within the
39
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
44/166
3.2. Inference on Trees: Belief Propagation
(a) = 0.1 (b) = 0.2 (c) = 1
Figure 3.2: Segmentation results for different values of. A higher value punishes seg-mentations with large boundaries; a high enough value (as in (c)) will make the result
either all foreground or all background.
brackets is true, and zero otherwise. The overall energy function is:
f(z) =(z) + (z), (3.9)
where is a weight parameter; higher values ofwill make it more likely that adjacent
pixels have similar labels.
The energy function in (3.9) is submodular, and can therefore be minimised efficiently
using the max flow code available at [58]. The segmentation results obtained for different
values ofare shown in Figure 3.2. The ratio between the unary and pairwise weights
influences the segmentation result produced.
3.2 Inference on Trees: Belief Propagation
While vision problems such as segmentation require a large number of variables (one per
image pixel), others, such as pose estimation, only require a smaller number of variables,
and hence a smaller graph. An important type of graph that is useful for this problem is
40
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
45/166
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
46/166
3.2. Inference on Trees: Belief Propagation
E) in the form of a message. Here, a messagecan be as simple as a scalar value, or a
matrix of values. This message is then combined with information relevant to the vertex
itself, to form a new message for the next vertex.
Messages are passed between vertices in a series of pre-defined updates. The number
of updates required to find an overall solution depends on the complexity of the graph.
If the graph has a simple structure, such as a chain (where each vertex is connected to
at most two other vertices, and the graph is not a cycle), then only one set of updates is
required to find the optimal set of values. This set can be found by an algorithm such as
the Viterbi algorithm [125].
3.2.2 Belief Propagation
Belief propagation can be viewed as a variation of the Viterbi algorithm that is applicable
to trees. To use this process to perform inference on a tree T, we must choose a vertex
of the tree to be the root vertex, denoted v0.
Since T is a tree, v0 is connected to each of the other vertices by exactly one path.We can therefore re-order the vertices such that, for any vertex vi, the path fromv0 tovi
proceeds via vertices with indices in ascending order.1 Once we have done this, we can
introduce the notions of parent-child relations between vertices, defined here for clarity.
Definition 3.9
We say that a vertex vi is the parentofvj if(vi, vj) E and i < j. If this is
the case, we say that vj is a childofvi.
Note that the root node has no parents, and each other vertex has exactly one parent,
since if a vertex vj had two parents, then there would be more than one path from v0
to vj, which contradicts the definition of a tree. However, a vertex may have multiple
1
There will typically be multiple ways to do this.
42
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
47/166
3.2. Inference on Trees: Belief Propagation
children, or none at all.
Definition 3.10
A vertexvi with no children is known as a leaf vertex.
We now describe the general form of belief propagation on our tree T. The vertices
in V are considered in two passes: a down pass, where the vertices are processed in
descending order, so that each vertex is processed after its children, but before its parent,
and an up pass, where the order is reversed.
For each leaf vertex vi, we have a set Li = {l1i , l2i , . . . , l
Kii } of possible labels, where
Ki is the number of labels in the set Li. Then the score associated with assigning a
particular labellpi to vertex vi is:
scorei(lpi ) =(vi = l
pi ). (3.10)
This score is the message that is passed to the parents ofvi.
For a vertex vj with at least one child, we need to combine these messages with the
unary and pairwise energies, in order to produce a message for the parents ofvj. Again,
we have a finite set Lj = {l1j , l2j , . . . , l
Kjj } of possible labels for vj . The score associated
with assigning a particular label lqj is:
scorej(lqj ) =(vj =lqj ) +
i>j:(vi,vj)E
mi(lqj ), (3.11)
where:
mi(lqj ) = max
lpi
(vi=l
pi , vj =l
qj ) +scorei(l
pi )
. (3.12)
When the root vertex is reached, the optimal labelv0 can be found by maximising score0,
defined in (3.11). Finally, the globally optimal configuration can be found by keeping
track of the arg max indices, and then tracing back through the tree on the up pass
43
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
48/166
3.3. Human Pose Estimation
to collect them. The up pass can be avoided if the arg max indices are recorded along
with the messages during the down pass.
One of the vision problems that is both interesting for computer games and suitable
for the application of belief propagation is human pose estimation, and it is this problem
that is described in the next section.
3.3 Human Pose Estimation
In laymans terms, the problem of human pose estimation can be stated as follows: given
an image containing a person, the objective is to correctly classify the persons pose. This
pose can either be in the form of selection from a constrained list, or freely estimating
the locations of a persons limbs, and the angles of their joints (for example, the location
of their left arm, and the angle at which their elbow is bent). This is often formalised
by defining a skeleton model, which is to be fitted to the image. It is quite common to
describe the human body as an articulated object, i.e. one formed of a connected set of
rigid parts. Such a formalisation gives rise to a family of parts-based models known as
pictorial structure models.
These models typically consist of six parts (if the objective is restricted to upper body
pose estimation) or ten parts (full body) [10, 3537]. The upper body model consists of
head, torso, and upper and lower arms; to extend this to the full body, upper and lower
leg parts are added. Having divided the human body into parts, one can then learn a
separate detector for each part, taking advantage of the fact that the parts have both a
simpler shape (not being articulated), and a simpler colour distribution.
Indeed, pose estimation can be formulated as an energy minimisation problem. In
contrast to segmentation problems, which require a different variable for each pixel, the
number of variables required is equal to the number of parts in the skeleton model.
However, the number of possible values that the variables can take is large (a part can,
in theory, occupy any position in the image, with any orientation and extent).
44
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
49/166
3.3. Human Pose Estimation
Figure 3.3: A depiction of the ten-part skeleton model used by Felzenszwalb and Hut-tenlocher [35].
3.3.1 Pictorial Structures
A pictorial structure model can be expressed as a graph G = (V, E), with the vertices
V = {v1, v2, . . . , vn} corresponding to the parts, and the edges Especifying which pairs
of parts(vi, vj) are connected. A typical graph is shown in Figure 3.3.
In Felzenszwalb and Huttenlochers pictorial structure model [35], a particular la-
belling of the graph is given by a configuration L = {l1, l2, . . . , ln}, where eachli specifies
the location(xi, yi)of partvi, together with its orientation, and degree of foreshortening
(i.e. the degree to which the limb appears to be shorter than it actually is, due to its
angle relative to the camera). The energy of this labelling is then given by:
E(L) =n
i=1
(li) +
(vi,vj)E
(li, lj), (3.13)
where, as in the previous section, represents the unary energy on part configuration,
andthe pairwise energy. These energies relate to the likelihood of a configuration, given
45
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
50/166
3.3. Human Pose Estimation
the image data, and given prior knowledge of the parts; more realistic configurations
will have lower energy. Despite the large number of possible configurations, a globally
optimal configuration can be found efficiently. This can be done by using simple appear-
ance models for each part, explained in the following section, and then applying belief
propagation.
Appearance Model
For each part, appearance models can be learned from training data, and can be based on
edges [95], colour-invariant features such as HOG [22,37], or the position of the part within
the image [38]. Another approach for video sequences is to apply background subtraction,
and define a unary potential based on the number of foreground pixels around the object
location [35].
Given an image, this appearance model can be evaluated over a dense grid [1]; to
speed this process up, a feature pyramid can be defined, so that a number of promising
locations are found from a coarse grid, and then higher resolution part filters produce
more precise matching scores [34,36]. In order to reduce the time taken by the inference
process, it might be desirable to reduce the set of possible part locations. Two ways to
do this are:
1. Thresholding, where part locations with a score that is worse than some predefined
value, or with a score outside the top Nvalues for some N.
2. Non-maximal suppression, which involves the removal of part locations that are
similar, but inferior, to other part locations.
Optimisation
After applying these techniques, we now have a small set of possible locations {l1i , l2i , . . . , l
ki }
for each vertex vi. For a leaf vertex vi, the score of each location lpi is:
scorei(lp
i ) =(lp
i ). (3.14)
46
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
51/166
3.3. Human Pose Estimation
Now, for a vertex vj with at least one child, the score is defined in terms of the children:
scorej(lpj ) =(l
pj ) +
vi:(vi,vj)E
mi(lpj ), (3.15)
where:
mi(lpj ) = max
lqi
(lqi , l
pj ) +scorei(l
qi )
. (3.16)
Finally, the top-scoring part configuration is found by finding the root location with the
highest score, and then tracing back through the tree, keeping track of the arg max indices.
Multiple detections can be generated by thresholding this score and using non-maximal
suppression.
3.3.2 Flexible Mixtures of Parts
Yang and Ramanan [131] extend these approaches by introducing a flexible mixture of
parts model, allowing for greater intra-limb variation.
Rather than using a classical articulated limb model such as that of Marr and Nishi-
hara [75], they introduce a new representation: a mixture of non-orientable pictorial
structures. Instead of having ten rigid parts, as the methods described in Section 3.3.1
do, their model has twenty-six rigid parts, which can be combined to form limbs and
produce an estimate for the ten parts, as shown in Figure 3.4. Each part has a number
Tof possible types, learned from training data. Types may include orientations of a part
(e.g. horizontal or vertical hand), and may also span semantic classes (e.g. open versus
closed hand).
Model
Let us denote an image by I, the location of part i by pi = (x, y), and its mixture
component by ti, with i {1, . . . , K }, pi {1, . . . , L}, and ti {1, . . . T }, where K is
the number of parts, L is the number of possible part locations, and Tis the number of
mixture components per part.
47
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
52/166
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
53/166
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
54/166
3.3. Human Pose Estimation
wheredxand dy represent the relative location of parti with respect toj . The parameter
w(ti,tj)
(i,j) encodes the expected values for dx and dy, tailored for types t
iand t
j. So ifi is
the elbow andj is the forearm, withtiandtj specifying vertically-oriented parts (i.e.the
arm is at the persons side), we would expect pj to be below pi on the image.
Inference
To perform inference on this model, Yang and Ramanan maximise S(I, p , t)overp and t.
Since the graph Gin Figure 3.5 is a tree, belief propagation (see Section 3.2.2) can again
be used. The score of a particular leaf node pi with mixture ti is:
scorei(ti, pi) =btii + w
tii (I, pi), (3.19)
and for all other nodes, we take into account the messages passed from the nodes children:
scorei(ti, pi) =btii + wtii (I, pi) +
cchildren(i)
mc(ti, pi), (3.20)
where:
mc(ti, pi) = maxtc
bti,tc(i,c)+ maxpcw
(ti,tc)(i,c) (pi pc). (3.21)
Once the messages passed reach the root part (i = 1), score1(c1, p1) contains the best-
scoring skeleton model given the root part location p1. Multiple detections can be gen-
erated by thresholding this score and using non-maximal suppression.
3.3.3 Unifying Segmentation and Pose Estimation
So far, a number of methods for solving either human segmentation or pose estimation
have been discussed. Some recent work has also been done that attempts to solve both
tasks together. In this section, we discuss PoseCut [16].
50
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
55/166
3.3. Human Pose Estimation
PoseCut
Bray et al. [16] tackle the segmentation problem by introducing a pose-specific Markov
random field (MRF), which encourages the segmentation result to look human-like.
This prior differs from image to image, as it depends on which pose the human is in.
Given an image, they find the best pose prior opt by solving:
opt= arg min
minx
3(x,)
, (3.22)
where x specifies the segmentation result, and 3 is the Object Category Specific MRF
from [65], which defines how well a pose prior fits a segmentation resultx. It is defined
as follows:
3(x,) = i
(D|xi) + (xi|) + j((I|xi, xj) + (xi, xj))
, (3.23)
where Iis the observed (image) data, (I|xi) is the unary segmentation energy, (xi|)
is the cost of the segmentation given the pose prior (penalising pixels near to the shape
being background, and pixels far from the shape being foreground), and the term is a
pairwise energy. Finally,(I|xi, xj) is a contrast-sensitive term, defined as:
(I|xi, xj) =
(i, j), ifxi=xj ; (3.24)
0, ifxi= xj ,
where (i, j)is proportional to the difference in RGB values of pixels i and j ; pixels with
similar values will have a high value for (i, j), since we wish to encourage these pixels
to have the same label.
Given a particular pose prior , the optimal configuration x = argminx3(x,)
can be found using a single graph cut. The final solutionarg minx3(x,opt) is found
using the Powell minimisation algorithm [94].
51
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
56/166
3.4. Stereo Vision
3.4 Stereo Vision
Stereo correspondence algorithms typically denote one image as the reference imageand
the other as the target image. A dense set of patches is extracted from the reference
image, and for each of these patches, the best match is found in the target image. The
displacement between the two patches is known as the disparity; the disparity for each
pixel in the reference image is stored in a disparity map. It can easily be shown that
the disparity of a pixel is inversely proportional to its distance from the camera, or its
depth [105]. A typical disparity map is shown in Figure 3.6.
A plethora of stereo correspondence algorithms have been developed over the years.
Scharstein and Szeliski [102] note that earlier methods can typically be divided into four
stages: (i) matching cost computation; (ii) cost aggregation; (iii) disparity computation;
and (iv) disparity refinement; later methods can be described in a similar fashion [20,60,
76,92,97].
It is quite common to use the sum of absolute differencesmeasure when finding the
matching cost for each pixel. A patch with a height and width of2n + 1 pixels for some
n 0 is extracted from the reference image. Then, for each disparity value d, a patch
is extracted from the target image, and the pixelwise intensity values for each patch are
compared.
WithL and R representing the reference (left) and target (right) images respectively,
the cost of assigning disparity d to a pixel (x, y) in L is as follows:
(x,y,d) =n
x=n
n
y=n
|L(x + x,y+ y) R(x + x d, y+ y)| (3.25)
Evaluating this cost over all pixels and disparities provides a cost volume, on which
aggregation methods such as smoothing can be applied in order to reduce noise. Disparity
values for each pixel can then be computed. The simplest method for doing this is just
to find for each pixel(x, y) the disparity value d which minimises (x,y,d).
However, such a method is likely to result in a high degree of noise. Additionally, pixels
immediately outside a foreground object are often given disparities that are higher than
52
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
57/166
3.4. Stereo Vision
(a) Left image
(b) Right image
(c) Disparity map
Figure 3.6: An example disparity map produced by a stereo correspondence algorithm.Pixels that are closer to the camera appear brighter in colour. Note the dark gaps
produced in areas with little or no texture.
53
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
58/166
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
59/166
the possibility of controlling video games with the human body. To this end, in the next
chapter, we will begin exploring the possibility of providing a human scene understanding
framework, combining pose estimation with segmentation and depth estimation.
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
60/166
-
8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms
61/166
Chapter4
3D Human Pose Estimation in a Stereo Pair
of Images
The problem of human pose estimation has been widely studied in the computer vision
literature; a survey of recent work is provided in Section 3.3. Despite the large body of
research focussing on 2D human pose estimation, relatively little work has been done to
estimate pose in 3D, and in particular, annotated datasets featuring frontoparallel stereo
views of humans are non-existent.
In recent years, some research has focussed on combining segmentation and pose
estimation to produce a richer understanding of a scene [10, 16, 66, 89]. Many of these
approaches simply put the algorithms into a pipeline, where the result of one algorithm
is used to drive the other [10, 16, 89]. The problem with this is that it often proves
impossible to recover from errors made in the early stages of the process. Therefore,
a joint inference framework, as proposed by Wang and Koller [127] for 2D human pose
estimation, is desired.
This chapter describes a new algorithm for estimating human pose in 3D, while sim-
ultaneously solving the problems of stereo matching and human segmentation. The al-
gorithm uses an optimisation method known as dual decomposition, of which we give an
overview in Section 4.1.
Following that, a new dataset for two-view human segmentation and pose-estimation,
57
-
8/13/2019 Beyon